shown in Figure 1.3.
It is significant that three or four tones reproducing a natural formant pattern evoke an experience in a naive listener of several concurrent whistles changing in pitch and loudness, and do not automatically elicit an impression of speech. The listener’s attention is free to follow the course of the auditory form of each component tone. Certainly, this aspect of a sinewave pattern is salient auditorily, and little of the raw quality prompts attention to the tones as a single compound contour. Studies show that listeners are well able to attend to individual tone components and to focus on the pattern of pitch changes each evokes over the run of a few seconds (Remez & Rubin, 1984, 1993). In other words, the immediate experience of the listener is accurately predicted by a generic auditory account, because acoustic elements that change frequency at different rates to different extents, onsetting and offsetting at different moments in different frequency ranges are dissimilar along many dimensions that specify separate perceptual streams according to gestalt principles.
Once instructed that the tones compose synthetic speech, a listener readily reports linguistic properties as if hearing the original natural utterance on which the sinewave replica was modeled. If attention to a complex, broadband contour is characteristic of the perceptual organization of speech, its sufficient condition is met in the absence of natural acoustic vocal products. Performance levels reported with this kind of copy synthesis have varied with the proficiency of the synthesis, although it has often been possible to achieve very good intelligibility, rivalling natural speech (for instance, Remez et al., 2008). Within this range of performance levels, these acoustic conditions pose a crucial test of a gestalt‐derived account of perceptual organization, for a perceiver must integrate the tones in order to compose a single sensory contour segregated from the background, ready to analyze for the linguistic properties borne on the pattern of the signal. Several tests support this claim of true integration preliminary to analysis.
In direct assessments, the intelligibility of sinewave replicas of speech exceeded intelligibility predicted from the presentation of individual tones (Remez et al., 1981, 1987, 1994). This superadditive performance is evidence of integration, and it persisted even when the tones came from separate spatial sources, violating similarity in location (Remez et al., 1994; see also Broadbent & Ladefoged, 1957). In combining the individual tones into a single time‐varying coherent stream, however, this complex organization, which is necessary for phonetic analysis, does not exclude an auditory organization as independently resolvable streams of tones (Remez & Rubin, 1984, 1993; Roberts, Summers, & Bailey, 2015). In fact, the perceiver’s resolution of the pitch contour associated with the frequency pattern of tonal constituents is acute whether or not the fusion of the tones supporting phonetic perception occurs (Remez et al., 2001). On this evidence rests the claim that sinewave replicas are bistable, exhibiting two simultaneous and exclusive organizations.
Figure 1.3 A comparison of the short‐term spectrum of natural speech (top); terminal analog synthetic speech (middle); and sinewave replica (below). Note the broadband resonances and harmonic spectra in natural and synthetic speech, in contrast to the sparse, nonharmonic spectrum of the three tones.
Even if the sensory causes of these perceptual impressions were strictly parallel, the bistable occurrence of auditory and phonetic perceptual organization is not amenable to further simplification. A sinewave replica of speech allows two organizations, much as celebrated cases of visual bistability do: the duck–rabbit figure, Woodworth’s equivocal staircase, Rubin’s vase, and Necker’s cube. Unlike the visual cases of alternating stability, the bistability that occurs in the perception of sinewave speech is simultaneous. A conservative description of these findings is that an organization of the auditory properties of sinewave signals occurs according to gestalt‐derived principles that promote segregation of the tones into separate contours. Phonetic perceptual analysis fails to apply or to succeed under that organization. However, the concurrent variation of the tones also satisfies a non‐gestalt principle of coordinate auditory variation despite local dissimilarities, and this promotes integration of the components into a single broadband stream. This organization, binding diverse components into a single complex sensory contour, is susceptible to phonetic analysis.
The perceptual organization of speech
Characteristics of the perceptual coherence of speech
While much remains to be discovered about perceptual organization that depends on sensitivity to complex coordinate variation, research on the psychoacoustics and perception of speech from a variety of laboratories permits a rough sketch of the parameters. The portrait of perceptual organization offered here gathers evidence from different research programs that aimed to address a range of perceptual questions, for there is no unified attempt at present to understand the organization of perceptual streams that approach the acoustic variety and distributed frequency breadth of speech. Overall, these results expose the perceptual organization of speech as fast, unlearned, nonsymbolic, keyed to complex patterns of sensory variation, indifferent to sensory quality, and requiring attention whether elicited or exerted.
The evidence that perceptual organization of speech is fast rests on long‐established findings that an auditory trace fades rapidly. Although estimates vary with the task used to calibrate the durability of unelaborated auditory sensation, all of the measures reflect the urgency with which the fading trace is recoded into a more stable phonetic form (Howell & Darwin, 1977; Pisoni & Tash, 1974). It is unlikely that much of the auditory form of speech persists beyond a tenth of a second, and it has decayed beyond recurrent access by 400 ms. The sensory integration required for perceptual organization is tied to this pace. Contrary to this notion of perceptual organization as exceedingly rapid, an extended version of auditory scene analysis (Bregman, 1990) proposes a resort to a cognitive mechanism occurring well after primitive grouping takes place, to function as a supplement to the gestalt‐based mechanism. Such knowledge‐based mechanisms also feature as a method to resolve difficult grouping in artifactual approaches to perceptual organization (e.g. Cooke & Ellis, 2001). However, the formal or practical advantages that this method achieves come at a clear cost, namely, to reject boundary conditions that subscribe to the natural auditory limits of perceptual organization.
The propensity to organize an auditory pattern by virtue of complex coordinate variation is apparently unlearned, or nearly so. In tests with infant listeners, 14‐week‐old subjects exhibited the pattern of adult sensitivity to dichotically arrayed components of synthetic syllables (Eimas & Miller, 1992; cf. Whalen & Liberman, 1987; Vouloumanos & Werker, 2007; Rosen & Iverson, 2007). In this case, the pattern of perceptual effects evident in infants was contingent on the integration of sensory elements despite detailed failures of auditory similarity on which gestalt grouping depends. Perhaps it is an exaggeration to claim that this organizational function is strictly unlearned, for even the youngest subject in the sample had been encountering airborne sound for three months, and undeniably had the opportunity to refine their sensitivity through this exposure. However, the development of sensitivity to complex auditory patterns cannot plausibly result from a history of meticulous trial and error in listeners of such a tender age, nor is it likely to reflect specific knowledge of the auditory effects that typify American English phonetic expression. It is far likelier that this sensitivity represents the emergence of an organizational component of listening that must be present for speech perception to develop (Houston & Bergeson, 2014), and 14‐week‐old infants still have several months ahead of them before the phonetic properties of speech become conspicuous (Jusczyk, 1997).
Research on sinewave replicas of speech has shown that the perceptual organization of speech is nonsymbolic and keyed to patterns of sensory variation. The evidence is provided by tests (Remez et al., 1994; Remez, 2001; Roberts, Summers, &