Группа авторов

The Handbook of Speech Perception


Скачать книгу

2017; Tiippana, Andersen, & Sams, 2004; see also Munhall et al., 2009). Unfortunately, relatively few of these studies have also tested unimodal conditions to determine whether these distractors might simply reduce detection of the requisite unimodal information. If, for example, less visual information can be extracted during distraction (of any type), then a reduced McGurk effect would likely be observed. In the few studies that have examined distraction of visual conditions, it seems unlikely that these tests are sufficiently sensitive (given the especially low baseline performance of straight lipreading; Alsius et al., 2005; Alsius, Navarra, & Soto‐Faraco, 2007; and for a review of this argument, see Rosenblum, 2019). Thus, to date, it is unclear whether outside attention can truly penetrate the speech integration function or instead simply distracts from the extraction of the visual information for a McGurk effect. Moreover, it could very well be that the McGurk effect itself may not constitute a thorough test of speech integration.

      Consequently, the effect has become a method for establishing under which conditions integration occurs. Measurements of the effect’s strength have been used to determine how multisensory speech perception is affected by: individual differences (see Strand et al., 2014, for a review); attention; and generalized face processing (e.g. Eskelund, MacDonald, & Andersen, 2015; Rosenblum, Yakel, & Green, 2000). The effect has also been used to determine where in the perceptual and neurophysiological process integration occurs and whether integration is complete (for discussions of these topics, see Brancazio & Miller, 2005).

      However, a number of researchers have recently questioned whether the McGurk effect should be used as a primary test of multisensory integration (Alsius, Paré, & Munhall, 2017; Remez, Beltrone, & Willimetz, 2017; Rosenblum, 2019; Irwin & DiBlasi, 2017; Brown et al. 2018). There are multiple reasons for these concerns. First, there is wide variability in most aspects of McGurk methodology (for a review, see Alsius, Paré, & Munhall, 2017). Most obviously, the specific talkers used to create the stimuli usually vary from project to project. The dubbing procedure – specifically, how the audio and visual components are aligned – also vary across laboratories. Studies will also vary as to which syllables are used, as well as the type of McGurk effect tested (fusion; visual dominance). Procedurally, the tasks (e.g. open response vs. forced choice), stimulus ordering (fully randomized vs. blocked by modality), and the control condition chosen (e.g. audio‐alone vs. audiovisually congruent syllables) vary across studies (Alsius, Paré, & Munhall, 2017). This extreme methodological variability may account for the wide range of McGurk effect strengths reported across the literature. Finding evidence of the effect under such different conditions does speak to its durability. However, the methodological variability makes it difficult to know whether influences on the effect’s strength are attributable to the variable in question (e.g. facial inversion), or to some superfluous characteristic of idiosyncratic stimuli and/or tasks.

      Another concern about the McGurk effect is whether it is truly representative of natural (nonillusory) multisensory perception (Alsius, Paré, & Munhall, 2017; Remez Beltrone, & Willimetz, 2017). It could very well be that different perceptual and neurophysiological resources are recruited when integrating discrepant rather than congruent audiovisual components. In fact, it has long been known that McGurk‐effect syllables (e.g. audio ba + visual a = va) are less compelling and take longer to identify (Brancazio, 2004; Brancazio, Best, & Fowler, 2006; Green & Kuhl, 1991; Jerger et al., 2017; Massaro & Ferguson, 1993; Rosenblum & Saldaña, 1992) than analogous audiovisual congruent syllables (audio va + visual va = va). This is true even when McGurk syllables are identified with comparable frequency (98 percent va; Rosenblum & Saldaña, 1992) to the congruent syllables. Relatedly, there is evidence that, when spatial and temporal offsets are applied to the audio and visual components, McGurk stimuli are more readily perceived as separate components than as audiovisually congruent syllables (e.g. Bishop & Miller, 2011; van Wassenhove, Grant, & Poeppel, 2007).

      Additional evidence that the McGurk effect may not be representative of normal integration comes from intersubject differences. It turns out that there is little evidence for a correlation between a subject’s likelihood to display a McGurk effect and their benefit in using visual speech to enhance noisy auditory speech (at least in normal hearing subjects; e.g. Van Engen, Xie, & Chandrasekaran, 2016; but see Grant & Seitz, 1998). Relatedly, the relationship between straight lip‐reading skill and susceptibility to the McGurk effect is weak at best (Cienkowski & Carney, 2002; Strand et al., 2014; Wilson et al., 2016; Massaro et al., 1986).

      A particularly troubling concern regarding the McGurk effect is evidence that its failure does not mean integration has not occurred (Alsius, Paré, & Munhall, 2017; Rosenblum, 2019). Multiple studies have shown that when the McGurk effect seems to fail and a subject reports hearing just the auditory segment (e.g. auditory /b/ + visual /g/ = perceived /b/), the influences of the visual, and perhaps integrated, segment are present in the gestural nuances of the subject’s spoken response (Gentilucci & Cattaneo, 2005; Sato et al., 2010; see Rosenblum, 2019 for further discussion). In another example, Brancazio and Miller (2005) showed that in instances when a visual /ti/ failed to change identification of an audible /pi/, a simultaneous manipulation of spoken rate of the visible /ti/ did influence the voice‐onset time perceived in the /pi/ (see also Green & Miller, 1985). Thus, information for voice‐onset time was integrated across the visual and audible syllables even when the McGurk effect failed to change the identification of the /pi/.

      It is unclear why featural integration can still occur in the face of a failed McGurk effect (Rosenblum, 2019; Alsius, Paré, & Munhall, 2017). It could be that standard audiovisual segment integration does occur in these instances, but the resultant segment does not change enough to be categorized differently. As stated, perceived segments based on McGurk stimuli are less robust than audiovisually congruent (or audio‐alone) perceived segments. It could be that some integration almost always occurs for McGurk segments, but the less canonical integrated segment sometimes induces a phonetic categorization that is the same as the auditory‐alone segment. Regardless, the fact that audiovisual integration of some type can occur when the McGurk effect appears to fail forces a reconsideration of the effect as a primary test of integration.