Группа авторов

The Handbook of Speech Perception


Скачать книгу

Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102(4), 1181–1186.

      199 Venezia, J. H., Fillmore, P., Matchin, W., et al. (2016). Perception drives production across sensory modalities: A network for sensorimotor integration of visual speech. NeuroImage, 126, 196–207.

      200 Venezia, J. H., Thurman, S. M., Matchin, W., et al. (2016). Timing in audiovisual speech perception: A mini review and new psychophysical data. Attention, Perception & Psychophysics, 78(2), 583–601.

      201 von Kriegstein, K., Dogan, O., Grüter, M., et al. (2008). Simulation of talking faces in the human brain improves auditory speech recognition. Proceedings of the National Academy of Sciences of the United States of America, 105(18), 6747–6752.

      202 von Kriegstein, K., & Giraud, A. L. (2006). Implicit multisensory associations influence voice recognition. PLOS Biology, 4(10), 1809–1820.

      203 von Kriegstein, K., Kleinschmidt, A., Sterzer, P., & Giraud, A. L. (2005). Interaction of face and voice areas during speaker recognition. Journal of Cognitive Neuroscience, 17(3), 367–376.

      204 Watkins, S., Shams, L., Tanaka, S., et al. (2006). Sound alters activity in human V1 in association with illusory visual perception. NeuroImage, 31(3), 1247–1256.

      205 Wayne, R. V., & Johnsrude, I. S. (2012). The role of visual speech information in supporting perceptual learning of degraded speech. Journal of Experimental Psychology: Applied, 18(4), 419–435.

      206 Wilson, A., Alsius, A., Paré, M., & Munhall, K. (2016). Spatial frequency requirements and gaze strategy in visual‐only and audiovisual speech perception, Journal of Speech, Language, and Hearing Research, 59, 601–615.

      207 Windmann, S. (2004). Effects of sentence context and expectation on the McGurk illusion. Journal of Memory and Language, 50(2), 212–230.

      208 Windmann, S. (2007). Sentence context induces lexical bias in audiovisual speech perception. Review of Psychology, 14(2), 77–91.

      209 Yakel, D. A., Rosenblum, L. D., & Fortier, M. A. (2000). Effects of talker variability on speechreading. Perception & Psychophysics, 62, 1405–1412.

      210 Yamamoto, E., Nakamura, S., & Shikano, K. (1998). Lip movement synthesis from speech based on hidden Markov models. Speech Communication, 26(1–2), 105–115.

      211 Yehia, H. C., Kuratate, T., & Vatikiotis‐Bateson, E. (2002). Linking facial animation, head motion, and speech acoustics. Journal of Phonetics, 30(3), 555–568.

      212 Yehia, H., Rubin, P., & Vatikiotis‐Bateson, E. (1998). Quantitative association of vocal‐tract and facial behavior. Speech Communication, 26(1–2), 23–43.

      213 Zheng, Y., & Samuel, A. G. (2019). How much do visual cues help listeners in perceiving accented speech? Applied Psycholinguistics, 40(1), 93–109.

      214 Zilber, N., Ciuciu, P., Gramfort, A., et al. (2014). Supramodal processing optimizes visual perceptual learning and plasticity. NeuroImage, 93, 32–46.

      OIWI PARKER JONES1 AND JAN W. H. SCHNUPP2

      1 University of Oxford, United Kingdom

      2 City University of Hong Kong, Hong Kong

      In this chapter, we provide a brief overview of how the brain’s auditory system represents speech. The topic is vast, many decades of research on the subject have generated several books’ worth of insight into this fascinating question, and getting close up and personal with this subject matter necessitates a fair bit of background knowledge about neuroanatomy and physiology, as well as acoustics and linguistic sciences. Providing a reasonably comprehensive overview of the topic that is accessible to a wide readership, within a short chapter, is a near‐impossible task, and we apologize in advance for the shortcomings that this chapter will inevitably have. With these caveats and without further ado, let us jump right in and begin by examining the question What is there to ‘represent’ in a speech signal?

      Readers of this volume are likely to be well aware that extracting such higher‐order features from speech signals is difficult and intricate. Once the physical aspects of the acoustic waveform are encoded, phonetic properties such as formant frequencies, voicing, and voice pitch must be inferred, interpreted, and classified in a context‐dependent manner, which in turn facilitates the creation of a semantic representation of speech. In the auditory brain, this occurs along a processing hierarchy, where the lowest levels of the auditory nervous system – the inner ear, auditory nerve fibers and brainstem – encode the physical attributes of the sound and compute what may be described as low‐level features, which are then passed on via the midbrain and the thalamus toward an extensive network of auditory and multisensory cortical areas, whose task it is to form phonetic and semantic representations. As this chapter progresses, we will look in some detail at this progressive transformation of an initially largely acoustic representation of speech sounds in the auditory nerve, brainstem, midbrain, and primary cortex to an increasingly linguistic feature representation in a part of the brain called the superior temporal gyrus, and finally to semantic representations in brain areas stretching well beyond those classically thought of as auditory structures.

      While it is apt to think of this neural speech‐processing stream as a hierarchical process, it would nevertheless be wrong to think of it as entirely a feed‐forward process. It is well known that, for each set of ascending nerve fibers carrying auditory signals from the inner ear to the brainstem, from brainstem to midbrain, from midbrain to thalamus, and from thalamus to cortex, there is a parallel descending pathway going from cortex back to thalamus, midbrain, brainstem and all the way back to the ear. This is thought to allow feedback signals to be sent in order to focus attention and to make use of the fact that the rules of language make the temporal evolution of speech sounds partly predictable, and such predictions can facilitate hearing speech in noise, or to tune the ear to the voice or dialect of a particular speaker.