Группа авторов

The Handbook of Speech Perception


Скачать книгу

speakers, there was no significant auditory suppression, but there was a positive effect between pSTG and PMC consistent with the idea of error feedback. The results suggest that PMC sends signal‐canceling, top‐down predictions to aSTG and pSTG. These top‐down predictions are stronger if you are a native speaker and more confident about what speech sounds you produce. In nonnative speakers, the top‐down predictions canceled less of the auditory input, and a bottom‐up learning signal (“error”) was fed back from the pSTG to the PMC. Interestingly, as the nonnative speakers became more proficient, the learning signals were observed to decrease, so that the most highly proficient nonnative speakers were indistinguishable from native speakers in terms of error feedback.

      The example of auditory suppression argues for a systems‐level view of speech comprehension that includes both auditory and premotor regions of the brain. Theoretically, we might think of these regions as being arranged in a functional hierarchy, with PMC located above both aSTG and pSTG. Top‐down predictions may thus be said to descend from PMC to aSTG and pSTG, while bottom‐up errors percolate in the opposite direction, from pSTG to PMC. We note that the framework used to interpret the auditory suppression results, predictive coding, subtly inverts the view that perceptual systems in the brain passively extract knowledge from the environment; instead, it proposes that these systems are actively trying to predict their sense experiences (Ballard, Hinton, & Sejnowski, 1983; Mumford, 1992; Kawato, Hayakawa, & Inui, 1993; Dayan et al., 1995; Rao & Ballard, 1999; Friston & Kiebel, 2009). In a foundational sense, predictive coding frames the brain as a forecasting machine, which has evolved to minimize surprises and to anticipate, and not merely react to, events in the world (Wolpert, Ghahramani, & Flanagan, 2001). This is not necessarily to say that what it means to be a person is to be a prediction machine, but rather to conjecture that perceptual systems in our brains, at least sometimes, predict sense experiences.

       Temporal prediction

      The importance of prediction as a theme and as a hypothetical explanation for neural function also goes beyond explicit modeling in neural networks. We can invoke the idea of temporal prediction even when we do not know about the underlying connectivity patterns. Speech, for example, does not consist of a static set of phonemes; rather, speech is a continuous sequence of events, such that hearing part of the sequence gives you information about other parts that you have yet to hear. In phonology the sequential dependency of phonemes is called phonotactics and can be viewed as a kind of prediction. That is, if the sequence /st/ is more common than /sd/, because /st/ occurs in syllabic onsets, then it can be said that /s/ predicts /t/ (more than /s/ predicts /d/). This use of phonotactics for prediction is made explicit in machine learning, where predictive models (e.g. bigram and trigram models historically, or, more recently, recurrent neural networks) have played an important role in the development and commercial use of speech‐recognition technologies (Jurafsky & Martin, 2014; Graves & Jaitly, 2014).

      In addition to filling in missing phonemes, the idea of temporal prediction can be invoked as an explanation of how the auditory system accomplishes one of its most difficult feats: selective attention. Selective attention is often called the cocktail party problem, because many people have experienced the use of selective attention in a busy, noisy party to isolate one speaker’s voice from the cacophonous mixture of many. Mesgarani and Chang (2012) simulated this cocktail party experience (unfortunately without the cocktails) by simultaneously playing two speech recordings to their subjects, one in each ear. The subjects were asked to attend to the recording presented to a specific ear and ECoG was used to record neural responses from the STG. Using the same stimulus‐reconstruction technique as Leonard et al. (2016), Mesgarani and Chang (2012) took turns reconstructing the speech that was played to each ear. Despite the fact that acoustic energy entered both ears and presumably propagated up the subcortical pathway, Mesgarani and Chang (2012) found that, once the neural processing of the speech streams had reached the STG, only the attended speech stream could be reconstructed; to the STG, it was as if the unattended stream did not exist.

Schematic illustration of the human brain reinstates missing auditory representations.