Группа авторов

The Handbook of Speech Perception


Скачать книгу

that less weight be placed on the McGurk effect in evaluating multisensory integration. Evaluation of integration may be better served with measures of the perceptual super‐additivity of visual and audio (e.g. in noise) streams (e.g. Alsius, Paré, & Munhall, 2017; Irwin & DiBlasi, 2017; Remez, Beltrone, & Willimetz, 2017); influences on speech‐production responses (Gentilucci & Cattaneo, 2005; and see Sato et al., 2010); and neurophysiological responses (e.g, Skipper et al., 2007). Such methods may very well be more stable, valid, and representative indexes of integration than the McGurk effect.

      The question of where in the speech function the modal streams integrate (merge) continues to be one of the most studied in the multisensory literature. Since 2005, much of this research has used neurophysiological methods. After the aforementioned fMRI report by Calvert and her colleagues (1997; see also Pekkola et al., 2005), numerous studies have also shown visual speech activation of the auditory cortex, often using other technologies, for example, functional near‐infrared spectroscopy (fNIR) (van de Rijt et al., 2016); electroencephalography (EEG; Callan et al., 2001; Besle et al., 2004); intercranial EEG (ECoG; e.g. Besle et al., 2008); magneto‐encephalography (MEG; Arnal et al., 2009; for a review, see Rosenblum, Dorsi, & Dias, 2016). More recent evidence shows that visual speech can modulate neurophysiological areas considered to be further upstream including the auditory brainstem (Musacchia et al., 2006), which is one of the earliest locations at which direct visual modulation could occur. There is even evidence of visual speech modulation of cochlear functioning (otoacoustic emissions; Namasivayam et al., 2015). While it is likely that visual influences on such peripheral auditory mechanisms are based on feedback from downstream areas, that it can occur indicates the importance of visual input to the speech function.

      Other neurophysiological findings suggest that the integration of the streams also happens early. A very recent EEG study revealed that N1 auditory‐evoked potentials (known to reflect primary auditory cortex activity) for visually induced (McGurk) fa and ba syllables (auditory ba + visual fa; auditory fa + visual ba, respectively) resemble the N1 responses for the corresponding auditory‐alone syllables (Shahin et al. 2018; and see van Wassenhove, Grant, & Poeppel, 2005). The degree of resemblance was larger for individuals whose identification responses showed greater visual influences, suggesting that this modulated auditory cortex activity (reflected in N1) corresponds to an integrated perceived segment. This finding is less consistent with the alternative model that separate unimodal analyses are first conducted at primary cortexes, with their outcomes then combined at a multisensory integrator, such as the posterior STS (pSTS; e.g. Beauchamp et al., 2004).

      The behavioral research also continues to show evidence of early crossmodal influences (for a review, see Rosenblum, Dorsi, & Dias, 2016). Evidence suggests that visual influences likely occur before auditory feature extraction (e.g. Brancazio, Miller, & Paré, 2003; Fowler, Brown, & Mann, 2000; Green & Gerdeman, 1995; Green & Kuhl, 1989; Green & Miller, 1985; Green & Norrix, 2001; Schwartz, Berthommier, & Savariaux, 2004). Other research shows that information in one modality is able to facilitate perception in the other even before the information is usable – and sometimes even detectable – on its own (e.g. Plass et al., 2014). For example, Plass and his colleagues (2014) used flash suppression to render visually presented articulating faces (consciously) undetectable. Still, if these undetected faces were presented with auditory speech that was consistent and synchronized with the visible articulation, then subjects were faster at recognizing that auditory speech. This suggests that useful crossmodal influences can occur even without awareness of information in one of the modalities.

      Other examples of the extreme super‐additive nature of speech integration have been shown in the context of auditory speech detection (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012) and identification (Schwartz, Berthommier, & Savariaux, 2004), as well audiovisual speech identification (Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Much of this research has been interpreted to suggest that, even without its own (conscious) clear phonetic determination, each modality can help the perceiver attend to critical information in the other modality through analogous patterns of temporal change in the two signals. These crossmodal correspondences are thought to be influential at an especially early stage (before feature extraction) to serve as a “bimodal coherence‐masking protection” against everyday signal degradation (e.g. Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004; see also Gordon, 1997). The impressive utility of these crossmodal correspondences will also help motivate the theoretical position proposed later in this chapter.

      However, other interpretations of these results have been offered which are consistent with early integration (Brancazio, 2004; Rosenblum, 2008). It may be that lexicality and sentence context does not bear on the likelihood of integration, but instead on how the post‐integrated segment is categorized. As stated, it is likely that syllables perceived from conflicting audiovisual information are less canonical than those based on congruent (or audio‐alone) information. This fact likely makes those syllables less robust, even when they are being identified as visually influenced segments. This could mean that, despite incongruent segments being fully integrated, the resultant perceived segment is more susceptible to contextual (e.g. lexical) influences than audiovisually congruent (and auditory‐alone) segments. This is certainly known to be the case for less canonical, more ambiguous audio‐alone segments as demonstrated in the Ganong effect, that is, when an ambiguous segment equally heard as k or g in isolation will be heard as the former when placed in front of the syllable iss, but as the latter if heard in front of ift (Connine & Clifton, 1987; Ganong, 1980). If the same is true of incongruent audiovisual segments, then lexical context may not bear on audiovisual integration as such, but on the categorization of the post‐integrated (and less canonical) segment (e.g. Brancazio, 2004).

      Still, other recent evidence has been interpreted as showing that a semantic analysis is conducted on the individual streams before integration is fully complete (see also Bernstein, Auer, & Moore, 2004). Ostrand and her colleagues (2016) present data showing that, despite a McGurk word being perceived as visually influenced (e.g. audio bait + visual date = heard date), the auditory component of the stimulus provides stronger priming of semantically related auditory words (audio bait + visual date primes worm more strongly than it primes calendar). This finding could suggest