speakers, there was no significant auditory suppression, but there was a positive effect between pSTG and PMC consistent with the idea of error feedback. The results suggest that PMC sends signal‐canceling, top‐down predictions to aSTG and pSTG. These top‐down predictions are stronger if you are a native speaker and more confident about what speech sounds you produce. In nonnative speakers, the top‐down predictions canceled less of the auditory input, and a bottom‐up learning signal (“error”) was fed back from the pSTG to the PMC. Interestingly, as the nonnative speakers became more proficient, the learning signals were observed to decrease, so that the most highly proficient nonnative speakers were indistinguishable from native speakers in terms of error feedback.
The example of auditory suppression argues for a systems‐level view of speech comprehension that includes both auditory and premotor regions of the brain. Theoretically, we might think of these regions as being arranged in a functional hierarchy, with PMC located above both aSTG and pSTG. Top‐down predictions may thus be said to descend from PMC to aSTG and pSTG, while bottom‐up errors percolate in the opposite direction, from pSTG to PMC. We note that the framework used to interpret the auditory suppression results, predictive coding, subtly inverts the view that perceptual systems in the brain passively extract knowledge from the environment; instead, it proposes that these systems are actively trying to predict their sense experiences (Ballard, Hinton, & Sejnowski, 1983; Mumford, 1992; Kawato, Hayakawa, & Inui, 1993; Dayan et al., 1995; Rao & Ballard, 1999; Friston & Kiebel, 2009). In a foundational sense, predictive coding frames the brain as a forecasting machine, which has evolved to minimize surprises and to anticipate, and not merely react to, events in the world (Wolpert, Ghahramani, & Flanagan, 2001). This is not necessarily to say that what it means to be a person is to be a prediction machine, but rather to conjecture that perceptual systems in our brains, at least sometimes, predict sense experiences.
Temporal prediction
The importance of prediction as a theme and as a hypothetical explanation for neural function also goes beyond explicit modeling in neural networks. We can invoke the idea of temporal prediction even when we do not know about the underlying connectivity patterns. Speech, for example, does not consist of a static set of phonemes; rather, speech is a continuous sequence of events, such that hearing part of the sequence gives you information about other parts that you have yet to hear. In phonology the sequential dependency of phonemes is called phonotactics and can be viewed as a kind of prediction. That is, if the sequence /st/ is more common than /sd/, because /st/ occurs in syllabic onsets, then it can be said that /s/ predicts /t/ (more than /s/ predicts /d/). This use of phonotactics for prediction is made explicit in machine learning, where predictive models (e.g. bigram and trigram models historically, or, more recently, recurrent neural networks) have played an important role in the development and commercial use of speech‐recognition technologies (Jurafsky & Martin, 2014; Graves & Jaitly, 2014).
In neuroscience, the theme of prediction comes up in masking and perceptual restoration experiments. One remarkable ECoG study, by Leonard et al. (2016), played subjects recordings of words in which key phonemes were masked by noise. For example, a subject might have heard /fæ#tr/, where the /#/ symbol represents a brief noise burst masking the underlying phoneme. In this example, the intended word is ambiguous: it could have been /fæstr/ ‘faster’ or /fæktr/ ‘factor’. So, by controlling the context in which the stimulus was presented, Leonard et al. (2016) were able to manipulate subjects to hear one word or the other. In the sentence ‘On the highway he drives his car much /fæ#tr/,’ we expect the listener to perceive the word ‘faster’ /fæstr/. In another sentence, that expectation was modified so that subjects perceived the same noisy segment of speech as ‘factor’ /fæktr/. Leonard et al. (2016) then used a technique called stimulus reconstruction, by which it is possible to infer rather good speech spectrograms from intracranial recordings (Mesgarani et al., 2008; Pasley et al., 2012). Spectrograms reconstructed from masked stimuli showed that the STG had filled in the missing auditory representations (Figure 3.9). For example, when the context was modulated so that subjects perceived the ambiguous stimulus as ‘faster’/fæstr/, the reconstructed spectrogram was shown to contain an imagined fricative(s) (Figure 3.9, panel E). When subjects perceived the word as ‘factor’/fæktr/, the reconstructed spectrogram contained an imagined stop [k] (Figure 3.9, panel F). In this way, Leonard et al. (2016) demonstrated that auditory representations of speech are sensitive to their temporal context.
In addition to filling in missing phonemes, the idea of temporal prediction can be invoked as an explanation of how the auditory system accomplishes one of its most difficult feats: selective attention. Selective attention is often called the cocktail party problem, because many people have experienced the use of selective attention in a busy, noisy party to isolate one speaker’s voice from the cacophonous mixture of many. Mesgarani and Chang (2012) simulated this cocktail party experience (unfortunately without the cocktails) by simultaneously playing two speech recordings to their subjects, one in each ear. The subjects were asked to attend to the recording presented to a specific ear and ECoG was used to record neural responses from the STG. Using the same stimulus‐reconstruction technique as Leonard et al. (2016), Mesgarani and Chang (2012) took turns reconstructing the speech that was played to each ear. Despite the fact that acoustic energy entered both ears and presumably propagated up the subcortical pathway, Mesgarani and Chang (2012) found that, once the neural processing of the speech streams had reached the STG, only the attended speech stream could be reconstructed; to the STG, it was as if the unattended stream did not exist.
We know from a second cocktail party experiment (which again did not include any actual cocktails) that selective attention is sensitive to how familiar the hearer is with each speaker. In their behavioral study, Johnsrude et al. (2013) recruited a group of subjects that included multiple spouses. If you were a subject in the study, your partner’s voice was sometimes the target (i.e. attended speech); your partner’s voice was sometimes the distractor (i.e. unattended speech); and sometimes both target and distractor voices belonged to other subjects’ spouses. Johnsrude et al. (2013) found that not only were subjects better at recalling semantic details of the attended speech when the target speaker was their partner, but they also performed better when their spouse played the role of distractor, compared to when both target and distractor roles were played by strangers. In effect, Johnsrude et al. (2013) amusingly showed that people are better at ignoring their own spouses than they are at ignoring strangers. Given that hearers can fill in missing information when it can be predicted from context (Leonard et al., 2016), it makes sense that subjects should comprehend the speech of someone familiar, whom they are better at predicting, than the speech of a stranger. Given that native speakers are better than nonnative speakers at suppressing the sound of their own voices (Parker Jones et al., 2013), it also makes sense that subjects should be better able to suppress the voice of their spouse – again assuming that their spouse’s voice is more predictable to them than a stranger’s. Taken together, these findings suggest that the mechanism behind selective attention is, again, prediction. So, while Mesgarani and Chang (2012) may be unable to reconstruct the speech of a distractor voice from ECoG recordings in the STG, it may be that higher brain regions will nonetheless contain a representation of the distractor voice for the purpose of suppressing it. An as yet unproven hypothesis is that the increased neural activity in frontal areas, observed during noisy listening conditions (Davis & Johnsrude, 2003), may be busy representing background noise or distractor voices, so that these sources may be filtered out of the mixed input signal. One way to test this may be to replicate Mesgarani and Chang’s (2012) cocktail party study, but with the focus on reconstructing speech from ECoG recordings taken from the auxiliary speech comprehension areas described by Davis and Johnsrude (2003) rather than from the STG.