Группа авторов

The Handbook of Speech Perception


Скачать книгу

before it is merged with the visual component and provides stronger priming than the visible word. If this contention were true, then it would mean that the channels are not fully integrated until at least a good amount of processing has occurred on the individual channels.

      In sum, much of the new results from the behavioral, and especially neurophysiological, research suggest that the audio and visual streams are merged as early as can be currently observed (but see Bernstein, Auer, & Moore, 2004). In the previous version of this chapter we argued that this fact, along with the ubiquity and automaticity of multisensory speech, suggests that the speech function is designed around multisensory input (Rosenblum, 2005). We further argued that the function may make use of the fact that there is a common informational form across the modalities. This contention will be addressed in the final section of this chapter.

      The notion that the speech mechanism may be sensitive to a form of information that is not tied to a specific sensory modality has been discussed for over three decades (e.g. Summerfield, 1987). This construal of multisensory speech information has been alternatively referred to as amodal, modality‐neutral (e.g. Rosenblum, 2005), and supramodal (Fowler, 2004; Rosenblum et al., 2016, 2017). The theory suggests a speech mechanism that is sensitive to a form of information that can be instantiated in multiple modalities. Such a mechanism would not need to contend with translating information across modality‐specific codes, or to involve a formal process of sensory integration (or merging), as such. From this perspective, the integration is a characteristic of the relevant information itself. Of course, the energetic details of the (light, sound, tactile‐mechanical) input and their superficial receptor reactions are necessarily distinct. But the deeper speech function may act to register the phonetically relevant higher‐order patterns of energy that can be functionally the same across modalities.

      The supramodal theory has been motivated by the characteristics of multisensory speech discussed earlier, including: (1) neurophysiological and behavioral evidence for the automaticity and ubiquity of multisensory speech; (2) neurophysiological evidence for a speech mechanism sensitive to multiple sensory forms; and (3) neurophysiological and behavioral evidence for integration occurring at the earliest observable stage; and (4) informational analyses showing a surprising close correlation between optic and acoustic informational variables for a given articulatory event. The theory is consistent with Carol Fowler’s direct approach to speech perception (e.g. Fowler, 1986, 2010), and James Gibson’s theory of multisensory perception (Gibson, 1966, 1979; and see Stoffregen & Bardy, 2001). The theory is also consistent with the task‐machine and metamodal theories of general multisensory perception which argue that function and task, rather than sensory system, is the guiding principle of the perceptual brain (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Striem‐Amit et al., 2011; see also Fowler, 2004; Rosenblum, 2013; Rosenblum, Dias, & Dorsi, 2017).

      Summerfield (1987) was the first to suggest that the informational form for certain articulatory actions can be construed as the same across vision and audition. As an intuitive example, he suggested that the higher‐order information for a repetitive syllable would be the same in sound and light. Consider a speaker repetitively articulating the syllable /ma/. For hearing, a repetitive oscillation of the amplitude and spectral structure of the acoustic signal would be lawfully linked to the repetitive movements of the lips, jaw, and tongue. For sight, a repetitive restructuring of the light reflecting from the face would also be lawfully linked to the same movements. While the energetic details of the information differ across modalities, the more abstract repetitive informational restructuring occurs in both modalities in the same oscillatory manner, with the same time course, so as to be specific to the articulatory movements. Thus, repetitive informational restructuring could be considered supramodal information – available in both the light and the sound – that acts to specify a speech event of repetitive articulation. A speech mechanism sensitive to this form of supramodal information would function without regard to the sensory details specific to each modality: the relevant form of information exists in the same way (abstractly defined) in both modalities. In this sense, a speech function that could pick up on this abstract form of information in multiple modalities would not require integration or translation of the information across modalities.

      Summerfield (1987) offered other examples of supramodal information such as how quantal changes in articulation (e.g. bilabial contact to no contact), and reversals in articulation (e.g. during articulation of a consonant–vowel–consonant such as /wew/) would be accompanied by corresponding quantal and reversal changes in the acoustic and optic structure.

      Other recent research has determined that some of the strongest correlations across audible and visible signals lie in the acoustic range of 2–3 kHz (Chandrasekaran et al., 2009). This may seem unintuitive because it is within this range that the presumed less visible articulatory movements of the tongue and pharynx play their largest role in sculpting the sound. However, the configurations of these articulators were shown to systematically influence subtle visible mouth movements. This fact suggests that there is a class of visible information that strongly correlates with the acoustic information formed by internal articulators. In fact, visual speech research has shown that the presumably “hidden” articulatory dimensions (e.g. lexical tone,