both temporally and spatially. Gestalt theory predicts that the common temporal dimension will provide organizational cues for binding these modalities during multimodal communication. That is, modality co-timing will serve to indicate and solidify their relatedness [Oviatt et al. 2003]. Consistent with this prediction, research has confirmed the following:
• Users adopt consistent co-timing of individual signals in their multimodal constructions, and their habitual pattern is resistant to change.
• When system errors or problem difficulty increase, users adapt the co-timing of their individual signals to fortify the whole multimodal construction.
Figure 1.1 Model of average temporal integration pattern for simultaneous and sequential integrators’ typical multimodal constructions. (From Oviatt et al. [2005])
From a neuroscience perspective, the general importance of modality co-timing is highlighted by previous findings showing that greater temporal binding, or synchrony of neuronal oscillations involving different sources of sensory input, is associated with improved task success. For example, correctly recognizing people depends on greater neural binding between multisensory regions that represent their appearance and voice [Hummel and Gerloff 2005].
Studies with over 100 users—children through seniors—have shown that users adopt one of two types of temporal organizational pattern when forming multimodal constructions. They either present simultaneous constructions in which speech and pen signals are overlapped temporally, or sequential ones in which one signal ends before the second begins and there is a lag between them [Xiao et al. 2002, 2003]. Figure 1.1 illustrates these two types of temporal integration pattern. A user’s dominant integration pattern is identifiable almost immediately, typically on the very first multimodal construction during an interaction. Furthermore, her habitual temporal integration pattern remains highly consistent (i.e., 88–93%), and it is resistant to change even after instruction and training.
A second Gestalt law, the principle of area, states that people will tend to group elements to form the smallest visible figure or briefest temporal interval. In the context of the above multimodal construction co-timing patterns, this principle predicts that most people will deliver their signal input simultaneously. Empirical research has confirmed that 70% of people across the lifespan are indeed simultaneous signal integrators, whereas 30% are sequential integrators [Oviatt et al. 2003].
Figure 1.2 Average increased signal overlap for simultaneous integrators in seconds (left), but increased lag for sequential integrators (right), as they handle an increased rate of system errors. (From Oviatt and Cohen [2015])
An important meta-principle underlying all Gestalt tendencies is the creation of a balanced and stable perceptual form that can maintain its equilibrium, just as the interplay of internal and external physical forces shape an oil drop [Koffka 1935, Kohler 1929]. Gestalt theory states that any factors that threaten a person’s ability to achieve a goal create a state of tension, or disequilibrium. Under these circumstances, it predicts that people will fortify basic organizational phenomena associated with a percept to restore balance [Koffka 1935, Kohler 1929]. As an example, if a person interacts with a multimodal system and it makes a recognition error so she is not understood, then this creates a state of disequilibrium. When this occurs, research has confirmed that users fortify, or further accentuate, their usual pattern of multimodal signal co-timing (i.e., either simultaneous or sequential) by approximately 50%. This phenomenon is known as multimodal hypertiming [Oviatt et al. 2003]. Figure 1.2 illustrates increased multimodal signal overlap in simultaneous integrators, but increased signal lag in sequential integrators as they experience more system errors. Multimodal hyper-timing also has been demonstrated in users’ constructions when problem difficulty level increases [Oviatt et al. 2003].
From a Gestalt viewpoint, this behavior aims to re-establish equilibrium by fortifying multimodal signal co-timing, the basic organizational principle of such constructions, which results in a more coherent multimodal percept under duress. This multimodal hyper-timing contributes to hyper-clear communication that increases the speed and accuracy of perceptual processing by a listener. This manifestation of hyper-clear multimodal communication is qualitatively distinct from the hyper-clear adaptations observed in unimodal components. For example, in a unimodal spoken construction users increase their speech signal’s total length and degree of articulatory control as part of hyperarticulation when system errors occur. However, this unimodal adaptation diminishes or disappears altogether when speech is part of a multimodal construction [Oviatt et al. 2003].
The Gestalt law of symmetry states that people have a tendency to perceive symmetrical elements as part of the same whole. They view objects as symmetrical, formed around a center point. During multimodal human-computer interaction involving speech and pen constructions, Gestalt theory would predict that more symmetrical organization entails closer temporal correspondence or co-timing between the two signal pieces, and a closer matching of their proportional length. This would be especially evident in significantly increased co-timing of the component signals’ onsets and offsets. Research on multimodal interaction involving speech and pen constructions has confirmed that users increase the co-timing of their signal onsets and offsets during disequilibrium, such as when system errors increase [Oviatt et al. 2003].
In summary, the Gestalt principles outlined above have provided a valuable framework for understanding how people perceive and organize multisensory information, as well as multimodal input to a computer interface. These principles have been used to establish new requirements for multimodal speech and pen interface design [Oviatt et al. 2003]. They also have supported computational analysis of other types of multimodal system, for example involving pen and image content [Saund et al. 2003].
One implication of these results is that time-sensitive multimodal systems need to accurately model users’ multimodal integration patterns, including adaptations in signal timing that occur during different circumstances. In particular, user-adaptive multimodal processing is a fertile direction for system development. One example is the development of new strategies for adapting temporal thresholds in time-sensitive multimodal architectures during the fusion process, which could yield substantial improvements in system response speed, robustness, and overall usability [Huang and Oviatt 2005, Huang et al. 2006].
An additional implication is that Gestalt principles, and the multisensory research findings that have further elaborated them, potentially can provide useful guidance for designing “well integrated” multimodal interfaces [Reeves et al. 2004]. Researchers have long been interested in defining what it means to be a well-integrated multimodal interface, including the circumstances under which super-additivity effects can be expected rather than interference between modalities. One particularly salient strategy is to integrate maximally complementary input modes in a multimodal interface, or ones that produce a highly synergistic blend in which the strengths of each mode can be capitalized upon and used to overcome weaknesses in the other [Cohen et al. 1989, Oviatt and Cohen 2015]. Complementarity can aim to minimize variance in estimations of individual signal interpretation, which maximizes the accuracy of the final multimodal interpretation, as discussed previously. Alternatively, it can aim to expand the functional utility of an interface for a user.
Further research could leverage theory and multidisciplinary research findings to determine what it means to be a well-integrated multimodal interface beyond simply selecting the modalities for inclusion. For further discussion of this topic, see Section 8.4, “Principles for Strategizing Multimodal Integration,” in Oviatt and Cohen [2015].
1.2 Working Memory Theory: Performance Advantages of Distributing Multimodal Processing
In comparison with Gestalt theory, a major theme of Working Memory theory is that attention and working