have been proposals, like the motor theory of speech perception (Liberman et al., 1967; Liberman & Mattingly, 1985) or the analysis‐by‐synthesis theory (Stevens, 1960) that view speech perception as a kind of active rather than passive process. Analysis by synthesis says that speech perception involves trying to match what you hear to what your own mouth, and other articulators, would have needed to do to produce what you heard. Speech comprehension would therefore involve an active process of covert speech production. Following this line of thought, we might suppose that what the vSMC does, when it is engaged in deciphering what your friend is asking you at a noisy cocktail party, is in some sense the same as what the vSMC does when it is used to articulate your reply. Because we know that place‐of‐articulation features take priority over manner‐of‐articulation features in the vSMC during a speech‐production task (i.e. reading consonant–vowel syllables aloud), we might hypothesize that place‐of‐articulation features will similarly take primacy during passive listening. Interestingly, despite being predicted by theory, this prediction is wrong.
When Cheung et al. (2016) examined neural response patterns in the vSMC while subjects listened to recordings of speech, they found that, as in the STG, it was the manner‐of‐articulation features that took precedence. In other words, representations in vSMC were conditioned by task: during speech production the vSMC favored place‐of‐articulation features (Bouchard et al., 2013; Cheung et al., 2016), but during speech comprehension the vSMC favored manner‐of‐articulation features (Cheung et al., 2016). As we discussed earlier, the STG is also organized according to manner‐of‐articulation features when subjects listen to speech (Mesgarani et al., 2014). Therefore the representations in these two areas, STG and vSMC, appear to use a similar type of code when they represent heard speech.
To be more concrete, Cheung et al. (2016) recorded ECoG from the STG and vSMC of subjects performing two tasks. One task involved reading aloud from a list of consonant–vowel syllables (e.g. ‘ba,’ ‘da,’ ‘ga’), while the other task involved listening to recordings of people producing these syllables. Instead of using hierarchical clustering, as Mesgarani et al. (2014) did in their study of the STG, Cheung et al. (2016) used a dimensionality‐reduction technique called multidimensional scaling (MDS) but with the similar goal of describing the structure of phoneme representations in the brain during each task (Figure 3.8). For the speaking task, the dimensionality‐reduced vSMC representations for eight sounds could be linearly separated into three place‐of‐articulation features: labial /p b/, alveolar /t d s ʃ/, and velar /k g/ (see Figure 3.8, panel D). The same phonemes could not be linearly separated into place‐of‐articulation features in the listening task (Figure 3.8, panel E); however they could be linearly separated into another set of features (Figure 3.8, panel G): voiced plosives /d g b/, voiceless plosives /k t p/, and fricatives /ʃ s/. These are the same manner‐of‐articulation and voicing features that characterize the neural responses in STG to heard speech (Figure 3.8, panel F). Again, the implication is that the vSMC has two codes for representing speech, suggesting that there are either two distinct but anatomically intermingled neural populations in vSMC, or the same population of neurons is capable of operating in two very different representational modes. Unfortunately, the spatial resolution of ECoG electrodes is still too coarse to resolve this ambiguity, so other experimental techniques will be needed. For now, we can only say that during speech production the vSMC uses a feature analysis that emphasizes place‐of‐articulation features, but during speech comprehension the vSMC uses a feature analysis that instead emphasizes manner features and voicing. An intriguing possibility is that the existence of similar representations for heard speech in the STG and the vSMC may play an important role in the communication, or connectivity, between distinct cortical regions – a topic we touch on in the next section.
Figure 3.8 Feature‐based representations in the human sensorimotor cortex. (a) and (b) show the most significant electrodes (gray dots) for listening and speaking tasks. (c) presents a feature analysis of the consonant phonemes used in the experiments. The left phoneme in each pair is unvoiced and the right phoneme is voiced (e.g. /p/ is unvoiced and /b/ is voiced). (d–g) are discussed in the main text; each panel shows a low‐dimensional projection of the neural data where distance between phoneme representations is meaningful (i.e. phonemes that are close to each other are represented similarly in the neural data). The dotted lines show how groups of phonemes can be linearly separated (or not) according to place of articulation, manner of articulation, and voicing features.
Source: Cheung et al., 2016. Licensed under CC BY 4.0.
Systems‐level representations and temporal prediction
Our journey through the auditory system has focused on specific regions and on the auditory representation of speech in these regions. However, representations in the brain are not limited to isolated islands of cells, but also rely upon constellations of regions that relay information within a network. In this section, we touch briefly on the topic of systems‐level representations of speech perception and on the related topic of temporal prediction, which is at the heart of why we have brains in the first place.
Auditory feedback networks
One way to appreciate the dynamic interconnectedness of the auditory brain is to consider the phenomenon of auditory suppression. Auditory suppression manifests, for example, in the comparison of STG responses when we listen to another person speak and when we speak ourselves, and thus hear the sounds we produce. Electrophysiological studies have shown that auditory neurons are suppressed in monkeys during self‐vocalization (Müller‐Preuss & Ploog, 1981; Eliades & Wang, 2008; Flinker et al., 2010). This finding is consistent with fMRI and ECoG results in humans, showing that activity in the STG is suppressed during speech production compared to speech comprehension (Eliades & Wang, 2008; Flinker et al., 2010). The reason for this auditory suppression is thought to be an internal signal (efference copy) received from another part of the brain, such as the motor or premotor cortex, which has inside information about external stimuli when those external stimuli are self‐produced (Holst & Mittelstaedt, 1950). The brain’s use of this kind of inside information is not, incidentally, limited to the auditory system. Anyone who has failed to tickle themselves has experienced another kind of sensory suppression, again thought to be based on internally generated expectations (Blakemore, Wolpert, & Frith, 2000).
Auditory suppression in the STG is also a function of language proficiency. As an example, Parker Jones et al. (2013) explored the interactions between premotor cortex and two temporal areas (aSTG and pSTG) when native and nonnative English speakers performed speech‐production tasks such as reading and picture naming in an MRI scanner. The fMRI data were then subjected to a kind of connectivity analysis, which can tell us which regions influenced which other regions of the brain. Technically, the observed signals were deconvolved to model the effect of the hemodynamic response, and the underlying neural dynamics were inferred by inverting a generative model based on a set of differential equations (Friston, Harrison, & Penny, 2003; Daunizeau, David, & Stephan, 2011). A positive connection between two regions, A and B, means that, when the response in A is strong, the response in B will increase (i.e. B will have a positive derivative). Likewise, a negative connection means that, when the response in A is strong, the response in B will decrease (B will have a negative derivative). Between the PMC and temporal auditory areas, Parker Jones et al. (2013) observed significant negative connections, implying that brain activity in the PMC caused a decrease in auditory temporal activity consistent with auditory suppression. However,