it was possible to make hypotheses about particular portions of the signal or cues corresponding to particular features of sounds or segments (phonetic categories). Using the pattern playback, these potential cues were then systematically varied and presented to listeners for their perception. Results reported in their seminal paper (Liberman et al., 1967) showed clearly that phonetic segments occur in context, and cannot be defined as separate “beads on a string.” Indeed, the context ultimately influences the acoustic manifestation of the particular phonetic segment, resulting in acoustic differences for the same features of sound. For example, sound spectrograms of stop consonants show a burst and formant transitions, which potentially serve as cues to place of articulation in stop consonants. Varying the onset frequency of the burst or the second formant transition and presenting them to listeners provided a means of systematically assessing the perceptual role these cues played. Results showed there was no systematic relation between a particular burst frequency or onset frequency of the second formant transition to place of articulation in stop consonants (Liberman, Delattre, & Cooper, 1952). For example, there was no constant burst frequency or formant transition onset that signaled [d] in the syllables [di] and [du]. Rather, the acoustic manifestation of sound segments (and the features that underlie them) is influenced by the acoustic parameters of the phonetic contexts in which they occur.
Liberman et al. (1967) recognized that listener judgments were still consistent. What then allowed for the various acoustic patterns to be realized as the same consonant? They proposed the motor theory of speech perception, hypothesizing that what provided the stability in the variable acoustic input was the production of the sounds or the articulatory gestures giving rise to them (for reviews see Galantucci, Fowler, & Turvey, 2006; Liberman et al., 1967; Fowler, 1986; Fowler, Shankweiler, & Studdert‐Kennedy, 2016). In this view, despite their acoustic variability, constant articulatory gestures provided phonetic category stability – [p] and [b] are both produced with the stop closure at the lips, [t] and [d] with the stop closure at the alveolar ridge, and [k] and [g] are produced with the closure at the velum.
It is worth noting, that even the motor theory fails to provide the nature of the mapping from the variable acoustic input to a particular articulatory gesture. That is, it is not specified what it is in the acoustic signal that allows for the transformation of the input to a particular motor pattern. In this sense, the motor theory of speech perception did not solve the invariance problem. That said, there are many proponents of the motor (gesture) theory of speech perception (see Fowler, Shankweiler, & Studdert‐Kennedy, 2016, for a review), and recently evidence from cognitive neuroscience has been used to provide support (see D’Ausilio, Craighero, & Fadiga, 2012 for a review). In particular, it has been shown in a number of studies that the perception of speech not only activates auditory areas of the brain (temporal structures) but also, under some circumstances, activates motor areas involved in speech production. For example, using fMRI, activation has been shown in motor areas during passive listening to syllables, areas activated in producing these syllables (Wilson et al., 2004), and greater activation has been shown in these areas for nonnative speech sounds compared to native speech sounds (Wilson & Iacoboni, 2006). Transmagnetic stimulation (TMS) studies showed a change in the perception of labial stimuli near the phonetic boundary of a labial–alveolar continuum after stimulation of motor areas involving the lips; no perceptual changes occurred for continua not involving labial stimuli, for example, alveolar–velar continua (Mottonen & Watkins, 2009; Fadiga et al., 2002). Nonetheless, activation of motor areas during speech perception in both the fMRI and TMS studies appears to occur under challenging listening conditions such as when the acoustic stimuli are of poor quality, for example, when sounds are not easily mapped to a native‐language inventory or during the perception of boundary stimuli, but not when the stimuli are good exemplars. These findings raise the possibility that frontal areas are recruited when additional neural resources are necessary, and thus are not core areas recruited in the perception of speech (see Schomers & Pulvermüller, 2016, for a contrasting view).
It would not be surprising to see activation of motor areas during the perception of speech, as listeners are also speakers, and speakers perceive the acoustic realization of their productions. That there is a neural circuit bridging temporal and motor areas then would be expected (see Hickok & Poeppel, 2007). However, what needs to be shown in support of the motor (gesture) theory of speech is that the patterns of speech‐perception representations are motoric or gestural. It is, of course, possible that there are gestural as well as acoustic representations corresponding to the features of speech. However, at the minimum, to support the motor theory of speech, gestures need to be identified that provide a perceptual standard for mapping from auditory input to phonetic feature. As we will see shortly, the evidence to date does not support such a view (for a broad discussion challenging the motor theory of speech perception see Lotto, Hickok, & Holt, 2009).
The acoustic theory of speech perception
Despite the variability in the speech input, there is the possibility that there are more generalized acoustic patterns that can be derived that are common to features of sounds, patterns that override the fine acoustic detail derived from analysis of individual components of the signal such as burst frequency or frequency of the onset of formant transitions. The question is where in the signal such properties might reside and how they can be identified.
One hypothesis that became the focus of the renewed search for invariant acoustic cues was that more generalized patterns could be derived at points where there are rapid changes in the spectrum. These landmarks serve as points of stability between transitions from one articulatory state to another (Stevens, 2002). Once the landmarks were identified, it was necessary to identify the acoustic parameters that provided stable patterns associated with features and ultimately phonetic categories. To this end, research focused on the spectral patterns that emerged from the integration of amplitude and frequency parameters within a window of analysis rather than considering portions of the speech signal that had been identified on the sound spectrogram and considered to be distinct acoustic events.
The first features examined in this way were place of articulation in stop consonants, the features that failed to show invariance in the Haskins research. In a series of papers, Stevens and Blumstein explored whether the shape of the spectrum in the 25‐odd ms at consonant release could independently characterize labial, alveolar, and velar stop consonants across speakers and vowel contexts. Here, labial consonants were defined in terms of a flat or falling spectral shape, alveolar consonants were defined in terms of a rising spectral shape, and velar consonants were defined in terms of a compact spectral shape with one peak dominating the spectrum (Stevens & Blumstein, 1978). Results of acoustic analysis of productions by six speakers of the consonants [p t k b d g] produced in the context of the vowels [i e a o u] classified the place of articulation of the stimuli with 85 percent accuracy (Blumstein & Stevens 1979). Follow‐up perceptual experiments showed that listeners could identify place of articulation (as well as the following vowel) with presentation of only the first 20 ms at the onset of the burst, indicating that they were sensitive to the spectral shape at stop consonant onset (Blumstein & Stevens, 1980; see also Chang & Blumstein, 1981).
Invariant properties were identified for additional phonetic features, giving rise to a theory of acoustic invariance hypothesizing that, despite the variability in the acoustic input, there were more generalized patterns that provided the listener with a stable framework for the perception of the phonetic features of language (Blumstein & Stevens, 1981; Stevens & Blumstein, 1981; see also Kewley‐Port, 1983; Nossair & Zahorian, 1991). These features include those signifying manner of articulation for [stops], [glides], [nasals], and [fricatives] (Kurowski & Blumstein, 1984; Mack & Blumstein, 1983; Shinn & Blumstein, 1984; Stevens & Blumstein, 1981). Additionally, research has shown that if the speech auditory input were normalized for speaker and vowel context, generalized patterns can be identified for both stop (Johnson, Reidy, & Edwards, 2018) and fricative place of articulation (McMurray & Jongman, 2011).
A new approach to the question of invariance provides perhaps the strongest support for the notion that listeners extract global invariant acoustic properties in processing the phonetic categories of speech. Pioneering work from the lab of