of sentences in which a sinewave replicating the second formant was presented to one ear while tone analogs of the first, third, and fricative formants were presented to the other ear. In such conditions, much as Broadbent and Ladefoged had found, perceptual fusion readily occurs despite the violation of spatial dissimilarity and the absence of other attributes to promote gestalt‐based grouping. To sharpen the test, an intrusive tone was presented in the same ear with the tone analogs of the first, third, and fricative tones. This single tone presented by itself does not evoke phonetic impressions, and is perceived as an auditory form without symbolic properties: it merely changes in pitch and loudness without phonetic properties. In order to resolve the speech stream under such conditions, a listener must reject the intrusive tone despite its spatial similarity to the first, third, and fricative tones of the sentence, and appropriate the tone analog of the second formant to form the speech stream despite its spatial displacement from the tones with which it combines. Control tests established that a tone analog of the second formant alone failed to evoke an impression of phonetic properties. Performance of listeners in a transcription task, a rough estimate of phonetic coherence, was good if the intrusive tone did not vary in a speechlike manner. That is, an intrusive tone of constant frequency or of alternating frequency had no effect on the perceptual organization of speech. When the intrusive tone exhibited the tempo and range of frequency variation appropriate for a second formant, without supplying the proper variation that would combine with other tones to form an intelligible stream, performance suffered. It was as if the criterion for integration of a tone was specific to its frequency variation under conditions in which it was nonetheless unintelligible.
Since the advent of the telephone, it has been obvious that a listener’s ability to find and follow a speech stream is indifferent to distortion of natural auditory quality. The lack of spectral fidelity in early forms of speech technology made speech sound phony, literally, yet it was readily recognized that this lapse of natural quality did not compromise the usefulness of speech as a communication channel (Fletcher, 1929). This fact indicates clearly that the functions of perceptual organization hardly aim to collect aspects of sensory stimulation that have the precise auditory quality of natural speech. Indeed, Liberman and Cooper (1972) argued that early synthesis techniques evoked phonetic perception because the perceiver cheerfully forgave departures from natural quality that were often extreme. In techniques such as speech chimeras (Smith, Delgutte, & Oxenham, 2002) and sinewave replication, the acoustic properties of intelligible signals lie beyond the productive capability of a human vocal tract, and the impossibility of such spectra as vocal sound does not evidently block the perceptual organization of the sound as speech. The variation of a spectral envelope can be taken by listeners to be speechlike despite acoustic details that give rise to impressions of gross unnaturalness. Findings of this sort contribute a powerful argument against psychoacoustic explanations of speech perception generally (e.g. Holt, 2005; Lotto & Kluender, 1998; Lotto, Kluender, & Holt, 1997; Toscano & McMurray, 2010), and perceptual organization specifically.
Ordinary subjective experience of speech suggests that perceptual organization is unbidden, for speech seems to pop right out of a nearby commotion. Yet studies reveal that sensory contours, whether simple or complex, form only with attention. In speech, as with simpler contours, the primitive segregation of figure and ground is at stake. Attention permits perceptual analysis to apply to a broadband contour of heterogeneous acoustic composition. Opposing this axiom – that sensory contours require attention to form – findings with sinewave replicas of utterances show that the perceptual organization of speech requires attention and is not an automatic consequence of a class of sensory effects. This feature differs from the automatically engaged process proposed in strict modular terms by Liberman and Mattingly (1985). With sinewave signals, most subjects fail to notice that concurrent tones can cohere unless they are asked specifically to listen for speech (Remez et al., 1981; also see Liebenthal et al., 2003), indicating that the auditory forms alone do not evoke speech perception. Critically, a listener who is asked to attend to arbitrary tone patterns as if listening to speech fails to report phonetic impressions (Remez et al., 1981), indicating that signal structure as well as phonetic attention are required for the organization and analysis of speech. A neural population code representing the speech spectrum without attention cannot be responsible for both the stable albeit unintegrated auditory form of sinewave speech and the stable integrated coherent contour that is susceptible to phonetic analysis (cf. Engineer et al., 2008). In this regard, general auditory perceptual organization is similar to speech perception in requiring attention for auditory figures to form (e.g. Carlyon et al., 2001). Of course, a natural vocal signal exhibits the phenomenal quality of speech, and this is evidently sufficient to elicit a productive form of attention for perceptual organization to ensue. This premise cautions against the use of passive listening procedures to identify supposed automatic functions of linguistic analysis of speech (e.g. Zevin et al., 2010). Such studies merely fail to secure attention. A listener whose attention is free to wander cannot be considered inattentive to the sounds delivered without instruction. In such conditions, performance arguably reflects a mix of cognitive states evoked with attention and vegetative excitation evoked without attention.
Generic auditory organization and speech perception
The intelligibility of sinewave replicas of utterances, of noise‐band vocoded speech, and of speech chimeras reveals that a perceiver can find and follow a speech signal composed of dissimilar acoustic and auditory constituents, in contrast to the principles on which gestalt‐based generic functions operate. These findings show that perceptual organization of speech can occur solely by virtue of attention to the complex coordinate variation of an acoustic pattern. The use of such exotic acoustic signals for the proof creates some uncertainty that ordinary speech perception is satisfactorily characterized by tests using these acoustic oddities. An argument of Remez et al. (1994) for considering these tests to be a useful index of the perception of commonplace speech signals begins by noting that phonetic perception of sinewave replicas of utterances depends on a simple instruction to listen to the tones as speech. Because the disposition to hear sinewave words and sentences appears readily, without arduous or lengthy training, this prompt adaptation to phonetic organization and analysis suggests that the ordinary cognitive resources of speech perception are operating for sinewave speech. Although some form of short‐term perceptual learning might be involved, the swiftness of the appearance of adequate perceptual function is evidence that any special induction to accommodate sinewave signals is a marginal component of perception.
Despite all, natural speech consists of large stretches of glottal pulsing, which creates amplitude comodulation over time and harmonic relations between concurrent portions of the spectrum. This has led to a reasonable proposal (Barker & Cooke, 1999; Darwin, 2008) that generic auditory grouping functions, although not necessary for the perceptual organization of speech, contribute to perceptual organization when speech spectra satisfy the gestalt criteria. The consistent finding that speech spectra organize quickly – on the order of milliseconds – and generic auditory grouping takes time to build – on the order of seconds – may justify doubt in the asserted privilege of gestalt‐based grouping by similarity. A critical empirical test was provided by Carrell and Opie (1992), which offers an index of the plausibility of the claim. In the test, the intelligibility of sinewave sentences was compared in two acoustic conditions: (1) three‐tone time‐varying sinusoids; and (2) three‐tone time‐varying sinusoids on which a regular amplitude pulse was imposed. Although the tone patterns in the first condition were not susceptible to gestalt‐based grouping, because they failed to exhibit similarity in each of the relevant dimensions that we have discussed, the pulsed tone patterns in the second condition exhibited amplitude comodulation and harmonicity in its complex spectra (Bregman, Levitan, & Liao, 1990). All other things being equal, the perceptual organization attributable to complex coordinate variation should have been reinforced by perceptual organization attributable to similarity that triggers generic auditory grouping. Indeed, Carrell and Opie found that pulsed sentences were more intelligible than smoothly varying sinusoids, as if the spectral components once bound more securely were more successfully analyzed.
The assertion offered by Barker and Cooke (1999) about this phenomenon is that generic auditory functions can reinforce the grouping of speech signals, although on