to speech using electrocorticography (ECoG). Here, intracranial electrophysiological recordings are made in patients with intractable seizures, with the goal of identifying the site of seizure activity. A grid of electrodes is placed on the surface of the brain and neural activity is recorded directly, with both good spatial and temporal resolution. In a recent study (Mesgarani et al., 2014), six participants listened to 500 natural speech sentences produced by 400 speakers. The sentences were segmented into sequences of phonemes. Results showed, not surprisingly, responses to speech in the posterior and mid‐superior temporal gyrus, consistent with fMRI studies showing that the perception of speech recruits temporal neural structures adjacent to the primary auditory areas (for reviews see Price, 2012; Scott & Johnsrude, 2003). Critically important were the patterns of activity that emerged. In particular, Mesgarani et al. (2014) showed selective responses of individual electrodes to features defining natural classes in English. That is, selective responses occurred for stop consonants including [p t k b d g], fricative consonants [s z f š ϴ], and nasals [m n ƞ]. That these patterns emerged across speakers, vowel, and phonetic contexts indicate that the inherent variability in the speech stream was essentially averaged out, leaving generalized patterns common to those features representing manner of articulation (see also Arsenault & Buchsbaum, 2015). It is unclear whether the patterns extracted are the same as those identified in the Stevens and Blumstein studies described above. However, what is clear is that the basic representational units corresponding to these features are acoustic in nature.
That responses in the temporal lobe are acoustic in nature is not surprising. A more interesting question is: What are the patterns of response to speech perception in frontal areas? As discussed earlier, some fMRI and TMS studies showed frontal activation during the perception of speech. However, what is not clear is what the neural patterns of those responses were; that is, did they reflect sensitivity to the acoustic parameters of the signal or to the articulatory gestures giving rise to the acoustic patterns?
In another notable study out of the Chang lab, Cheung and colleagues (2016) used ECoG to examine neural responses to speech perception in superior temporal gyrus sites, as they did in Mesgarani et al. (2014). Critically, they also examined neural responses to both speech perception and speech production in frontal areas, in particular in the motor cortex – the ventral half of lateral sensorimotor cortex (vSMC). Nine participants listened to and produced the consonant–vowel (CV) syllables [pa ta ka ba da ga, sa, ša] in separate tasks, and in a third task, passively listened to portions of a natural speech corpus (TIMIT) consisting of 499 sentences spoken by a total of 400 male and female speakers. As expected, for production, responses in the vSMC reflected the somatotopic representation of the motor cortex with distinct clustering as a function of place of articulation. That is, as expected, separate clusters emerged reflecting the different motor gestures used to produce labial, alveolar, and velar consonants.
Results of the passive listening task replicated Mesgarani et al.’s (2014) findings, showing selective responses in the superior temporal gyrus (STG) to manner of articulation as a function of manner of articulation, that is, the stop consonants clustered together and the fricative consonants clustered together. Of importance, a similar pattern emerged in the vSMC: neural activity clustered in terms of manner of articulation, although interestingly the consonants within each cluster did not group as closely as the clusters that emerged in the STG. Thus, frontal areas are indeed activated in speech perception; however, this activation appears to correspond to the acoustic representation of speech extracted from the auditory input rather than being a transformation of the auditory input to articulatory, motor, or gestural representations. While only preliminary, these neural findings suggest that the perceptual representation of features, even in motor areas, are acoustic or auditory in nature, not articulatory or motor. These results are preliminary but provocative. Additional research is required to examine neural responses in frontal areas to auditory speech input to the full consonant inventory across vowel contexts, phonetic position, and speakers. The question is: When consonant, vowel, or speaker variability is increased in the auditory input, will neural responses in frontal areas pattern with spectral and temporal features or gestural features.
Conclusion
This chapter has examined the role of features in speech perception and auditory word recognition. As described, while features have generally been considered representational units in speech perception, there has been a lack of consensus about the nature of the feature representations themselves. In our view, one of the major conflicts in current theories of speech has its roots in whether researchers have focused on identifying the attributes that define the phonetic categories of speech or, alternatively, have focused on characterizing the ways in which contextual factors can influence the boundaries between phonetic categories (see Samuel, 1982). In the former, the emphasis has been on describing the acoustic‐articulatory structure of phonetic categories in the latter, the emphasis has been on characterizing the ways in which acoustic changes ultimately affect the perception of boundaries between phonetic categories. These different emphases have also resulted in different conclusions. Studies focusing on the boundaries between phonetic segments have documented the ease with which boundary shifts have been obtained consequent to any number of acoustic manipulations, and as such the conclusion has been that there is no stable pattern of acoustic information that corresponds to these categories. Analyses of the acoustic characteristics of speech have produced mixed results. Focusing on individual cues and considering them as distinct events failed to show stable acoustic patterns associated with these cues. In contrast, focusing on the integration of spectral‐temporal properties revealed more generalized patterns or properties which contribute to the identification of a phonetic segment or phonetic feature.
So what is the story? Does acoustic invariance obviate variability? Does variability trump invariance? In both cases, we believe not. Both stable acoustic patterns and variability inherent in the speech stream play a critical role in speech perception and word recognition processes. Invariant acoustic patterns corresponding to features allow for stability in perception. As such, features serve as essential building blocks for the speaker‐hearer in processing the sounds of language. They provide the framework for the speaker‐hearer in processing speech and ultimately words, by allowing for acoustically variable manifestations of sound in different phonetic contexts to be realized as one and the same phonetic dimension. In short, they serve as a means of bootstrapping the perceptual system for the critical job of mapping the auditory input not only onto phonetic categories but also onto words.
But variability plays a crucial role as well. It allows for graded activation within the language‐processing stream and hence provides the perceptual system with a richness and flexibility in accessing phonetic features, words, and even meanings that would be impossible were variability to be treated as “noise” and not be represented by the listener. Sensitivity to variability allows listeners to recognize differences that are crucial in language communication. For example, retaining fine structure information allows us to recognize the speaker of a message. And variability allows for the establishment and internalization of probability distributions. Presumably, acoustic inputs that are infrequently produced would require more processing and neural resources compared to acoustic inputs that are in the center of a category or that match more closely a word representation. As such, both processing and neural resources would be freed up when more frequent features and lexical items occur, and additional resources would be needed for less frequent occurrences. In this way, the system is not only flexible but also plastic, affording a means for the basic stable structure of speech to be shaped and influenced by experience.
REFERENCES
1 Andruski, J. E., Blumstein, S. E., & Burton, M. (1994). The effect of subphonetic differences on lexical access. Cognition, 52(3), 163–187.
2 Apfelbaum, K. S., Blumstein, S. E., & McMurray, B. (2011). Semantic priming is affected by real‐time phonological competition: Evidence for continuous cascading systems. Psychonomic Bulletin & Review, 18(1), 141–149.
3 Arsenault, J. S., & Buchsbaum, B. R. (2015). Distributed neural representations of phonological features during speech perception. Journal of Neuroscience, 35(2), 634–642.
4 Bailey,