linguistic constructions involving spatial descriptions. Their approach involves computational modeling of communication, informed by substantial cognitive science and linguistics research, which also leverages data-driven processing. They provide a detailed walkthrough of their multimodal speech and gesture production model, which is based on activation spreading within dynamically shaped multimodal memories. They argue that semantic coordination across modalities arises from the interplay of modality-specific representations for speech and gestures under given cognitive resources. Results from preliminary simulation experiments predict the likelihood that multimodal constructions will involve gestures that are redundant vs. complementary with co-occurring speech, which varies when cognitive resources are less vs. more constrained, respectively. Kopp and Bergmann’s chapter provides a thoughtful discussion of the role and value of cognitive modeling in developing multimodal systems, as well as the specific use of multimodal speech and gesture production models for developing applications like virtual characters and social robotics.
Multimodal interfaces are well known to be the preferred direction for supporting individual differences and universal access to computing. In Chapter 8, Munteanu and Salah challenge us to understand the needs of one of the most rapidly growing and underserved populations in the world—seniors over 65 years. As a starting point, this chapter summarizes Maslow’s hierarchy of human needs in order to understand and design valuable technology for seniors. This includes designing for their basic physical needs (e.g., self-feeding, medications), safety and social-emotional needs (e.g., preventing falls, physical isolation, and loneliness), and esteem and self-actualization needs (e.g., independence, growth, and mastery experiences). Among the challenges of designing for this group are the substantial individual differences they exhibit (i.e., from healthy mobile, to physically and cognitively disabled), and their frequently changing status as they age. Munteanu and Salah describe examples of especially active application areas, such as socially assistive robotics, communication technologies for connecting and sharing activities within families, technologies for accessing digital information, personal assistant technologies, and ambient assistive living smart-home technologies. They highlight design methods and strategies that are especially valuable for this population, such as participatory design, adaptive multimodal interfaces that can accommodate seniors’ individual differences, balanced multimodal-multisensor interfaces that preserve seniors’ sense of control and dignity (i.e., vs. simply monitoring them), and easy-to-use interfaces based on rudimentary speech, touch/haptics, and activity tracking input.
Common Modality Combinations
Several chapters discussed above already have illustrated common modality combinations in multimodal-multisensor interfaces—for example, commercially available touch and pen input (Chapter 4), and multimodal output incorporating haptic and non-speech audio (Chapter 7) and speech and manual gesturing (Chapter 6). The chapters that follow examine common modality combinations in greater technical detail, with an emphasis on four different types of speech-centric multimodal input interfaces—incorporating user gaze, pen input, gestures, and visible speech movements. These additional modalities exhibit significant differences among them, most importantly in the sensors used, approach to information extraction and representation, fusion and integration of the second input modality with the speech signal, and specific application scenarios. These chapters address the main challenges posed by each of these modality combinations, and the most prevalent and successful techniques for building related systems.
In Chapter 9, Qvarfordt outlines the properties of human gaze, its importance in human communication, methods for capturing and processing it automatically, and its incorporation in multimodal interfaces. In particular, she reviews basic human eye movements, and discusses how eye-tracking devices capture gaze information. This discussion emphasizes gaze signal processing and visualization, but also practical limitations of the technology. She then provides an overview of the role that gaze plays when combined with other modalities such as pointing, touch, and spoken conversation during interaction and communication. As a concrete example, this discussion details a study on the utility of gaze in multi-party conversation over shared visual information. In the final section of this chapter, Qvarfordt discusses practical systems that include gaze. She presents a designspace taxonomy of gaze-informed multimodal systems, with two axes representing gaze as active vs. passive input, and in stationary vs. mobile usage scenarios. A rich overview then is presented of gaze-based multimodal systems for selection, detecting user activity and interest, supporting conversational interaction, and other applications.
In Chapter 10, Cohen and Oviatt motivate why writing provides a synergistic combination with spoken language. Based on the complementarity principle of multimodal interface design, these input modes have opposite communication strengths and weaknesses: whereas spoken language excels at describing objects, time, and events in the past or future, on the other hand, writing is uniquely able to render precise spatial information including diagrams, symbols, and information in a specific spatial context. Since error patterns of the component recognizers also differ, multimodal systems that combine speech and writing can support mutual disambiguation that yields improved robustness and stability of performance. In this chapter, the authors describe the main multimodal system components, language processing techniques, and architectural approaches for successfully processing users’ speech and writing. In addition, examples are provided of both research and commercially deployed multimodal systems, with rich illustrations of the scenarios they are capable of handling. Finally, the performance characteristics of multimodal pen/voice systems are compared with their unimodal counterparts.
The remaining two chapters present more of a signal-processing perspective on multimodal system development, a topic that will be elaborated in greater detail in [Oviatt et al. 2017a]. In Chapter 11, Katsamanis et al. discuss the ubiquity of multimodal speech and gesturing, which co-occur in approximately 90% of communications across cultures. They describe different types of gestures, segmental phases in their formation, and the function of gestures during spoken communication. In the second part of the chapter, the authors shift to presenting an overview of state-of-the-art multimodal gesture and speech recognition, in particular temporal modeling and detailed architectures for fusing these loosely-synchronized modalities (e.g., Hidden Markov Models, Deep Neural Nets). In order to facilitate readers’ practical understanding of how multimodal speech and gesture systems function and perform, the authors present a walk-through example of their recently developed system, including its methods for capturing data on the bimodal input streams (i.e., using RGB-D sensors like Kinect), feature extraction (i.e., based on skeletal, hand shape, and audio features), and two-pass multimodal fusion. They provide a detailed illustration of the system’s multimodal recognition of a word-gesture sequence, which shows how errors during audio-only and single-pass processing can be overcome during two-pass multimodal fusion. The comparative performance accuracy of this multimodal system also is summarized, based on the well-known ChaLearn dataset and community challenge.
In Chapter 12, Potamianos et al. focus on systems that incorporate visual speech information from the speaker’s mouth region into the traditional speech-processing pipeline. They motivate this approach based on the inherently audiovisual nature of human speech production and perception, and also by providing an overview of typical scenarios in which these modalities complement one another to enhance robust recognition of articulated speech (e.g., during noisy conditions). In the main part of their chapter, the authors offer a detailed review of the basic sensory devices, corpora, and techniques used to develop bimodal speech recognition systems. They specifically discuss visual feature extraction (e.g., based on facial landmarks, regions of interest), and audio-visual fusion that leverages the tight coupling between visible and audible speech. Since many of the algorithmic approaches presented are not limited to automatic speech recognition, the authors provide an overview of additional