of speech presented in noise in which listeners could also see the talkers whose words they aimed to recognize. The point of the study was to calibrate the level at which the speech signal would become so faint in the noise that to sustain adequate performance attention would switch from an inaudible acoustic signal to the visible face of the talker. In fact, the visual channel contributed to intelligibility at all levels of performance, indicating that the perception of speech is ineluctably multisensory. But how does the perceiver determine the audible and visible composition of a speech stream? This problem (reviewed by Rosenblum & Dorsi, Chapter 2) is a general form of the listener’s specific problem of perceptual organization, understood as a function that follows the speechlike coordinate variation of a sensory sample of an utterance. To assign auditory effects to the proper source, the perceptual organization of speech must capture the complex sound pattern of a phonologically governed vocal source, sensing the spectro‐temporal variation that transcends the simple similarities on which the gestalt‐derived principles rest. It is obvious that gestalt principles couched in auditory dimensions would fail to merge auditory attributes with visual attributes. Because auditory and visual dimensions are simply incommensurate, it is not obvious that any notion of similarity would hold the key to audiovisual combination. The properties that the two senses share – localization in azimuth and range, and temporal pattern – are violated freely without harming audiovisual combination, and therefore cannot be requisite for multisensory perceptual organization.
The phenomena of multimodal perceptual organization confound straightforward explanation in yet another instructive way. Audiovisual speech perception can be fine under conditions in which the audible and visible components are useless separately for conveying the linguistic properties of the message (Rosen, Fourcin, & Moore, 1981; Remez et al., forthcoming). This phenomenon alone disqualifies current models that assert that phoneme features are derived separately in each modality as long as they are taken to stem from a single event (Magnotti & Beauchamp, 2017). In addition, neither spatial alignment nor temporal alignment of the audible and visible components must be veridical for multimodal perceptual organization to deliver a coherent stream fit to analyze (see Bertelson, Vroomen, & de Gelder, 1997; Conrey & Pisoni, 2003; Munhall et al., 1996). Under such discrepant conditions, audiovisual integration occurs despite the perceiver’s evident awareness of the spatial and temporal misalignment, indicating a divergence in the perceptual organization of events and the perception of speech. In consequence, it is difficult to conceive of an account of such phenomena by means of perceptual organization based on tests of similar sensory details applied separately in each modality. Instead, it is tempting to speculate that an account of perceptual organization of speech can ultimately be characterized in dimensions that are removed from any specific sensory modality, and yet be expressed in parameters that are appropriate to the sensory samples available at any moment.
Conclusion
Perceptual organization is the critical function by which a listener resolves the sensory samples into streams specific to worldly objects and events. In the perceptual organization of speech, the auditory correlates of speech are resolved into a coherent stream that is fit to be analyzed for its linguistic and indexical properties. Although many contemporary accounts of speech perception are silent about perceptual organization, it is unlikely that the generic auditory functions of perceptual grouping provide adequate means to find and follow the complex properties of speech. It is possible to propose a rough outline of an adequate account of the perceptual organization of speech by drawing on relevant findings from different research projects spanning a variety of aims. The evidence from these projects suggests that the critical organizational functions that operate for speech are that it is fast, unlearned, nonsymbolic, keyed to complex patterns of coordinate sensory variation, indifferent to sensory quality, and requiring attention whether elicited or exerted. Research on other sources of complex natural sound has the potential to reveal whether these functions are unique to speech or are drawn from a common stock of resources of unimodal and multimodal perceptual organization.
Acknowledgments
In conducting some of the research described here and in writing this chapter, the author is grateful for the sympathetic understanding of Samantha Caballero, Mariah Marrero, Lyndsey Reed, Hannah Seibold, Gabriella Swartz, Philip Rubin, and Michael Studdert‐Kennedy. This work was supported by a grant from the National Science Foundation (SBE 1827361).
REFERENCES
1 Barker, J., & Cooke, M. (1999). Is the sine‐wave cocktail party worth attending? Speech Communication, 27, 159–174.
2 Bertelson, P., Vroomen, J., & de Gelder, B. (1997). Auditory–visual interaction in voice localization and in bimodal speech recognition: The effects of desynchronization. In C. Benoît & R. Campbell (Eds), Proceedings of the Workshop on Audio‐Visual Speech Processing: Cognitive and computational approaches (pp. 97–100). Rhodes, Greece: ESCA.
3 Billig, A. J., Davis, M. H., & Carlyon, R. P. (2018). Neural decoding of bistable sounds reveals an effect of intention on perceptual organization. Journal of Neuroscience, 38, 2844–2853.
4 Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: MIT Press.
5 Bregman, A. S., Abramson, J., Doehring, P., & Darwin, C. J. (1985). Spectral integration based on common amplitude modulation. Perception & Psychophysics, 37, 483–493.
6 Bregman, A. S., Ahad, P. A., & Van Loon, C. (2001). Stream segregation of narrow‐band noise bursts. Perception & Psychophysics, 63, 790–797.
7 Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequence of tones. Journal of Experimental Psychology, 89, 244–249.
8 Bregman, A. S., & Dannenbring, G. L. (1973). The effect of continuity on auditory stream segregation. Perception & Psychophysics, 13, 308–312.
9 Bregman, A. S., & Dannenbring, G. L. (1977). Auditory continuity and amplitude edges. Canadian Journal of Psychology, 31, 151–158.
10 Bregman, A. S., & Doehring, P. (1984). Fusion of simultaneous tonal glides: The role of parallelness and simple frequency relations. Perception & Psychophysics, 36, 251–256.
11 Bregman, A. S., Levitan, R., & Liao, C. (1990). Fusion of auditory components: effects of the frequency of amplitude modulation. Perception & Psychophysics, 47, 68–73.
12 Bregman, A. S., & Pinker, S. (1978). Auditory streaming and the building of timbre. Canadian Journal of Psychology, 32, 19–31.
13 Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different sense organs. Journal of the Acoustical Society of America, 29, 708–710.
14 Carlyon, R. P., Cusack, R., Foxton, J. M., & Robertson, I. H. (2001). Effects of attention and unilateral neglect on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 27, 115–127.
15 Carlyon, R. P., Plack, C. J., Fantini, D. A., & Cusack, R. (2003). Crossmodal and non‐sensory influences on auditory streaming. Perception, 32, 1393–1402.
16 Carrell, T. D., & Opie, J. M. (1992). The effect of amplitude comodulation on auditory object formation in sentence perception. Perception & Psychophysics, 52, 437–445.
17 Cherry, E. (1953). Some experiments on the recognition of speech, with one and two ears. Journal of the Acoustical Society of America, 25, 975–979.
18 Conrey, B. L., & Pisoni, D. B. (2003). Audiovisual asynchrony detection for speech and nonspeech signals. In J.‐L. Schwartz, F. Berthommier, M.‐A. Cathiard, & D. Sodoyer (Eds), Proceedings of AVSP 2003: International Conference on Audio‐Visual Speech Processing. St. Jorioz, France September 4–7, 2003 (pp. 25–30). Retrieved September 24, 2020, from https://www.isca‐speech.org/archive_open/avsp03/av03_025.html.
19 Cooke,