Sharon Oviatt

The Handbook of Multimodal-Multisensor Interfaces, Volume 1


Скачать книгу

will differ qualitatively depending on whether a system is multimodal or unimodal, which is consistent with the Gestalt principle of totality (see separate entries on multimodal hypertiming and hyperarticulation). During disequilibrium, behavioral adaptation observed in the user aims to fortify the organizational principles described in Gestalt laws in order to restore a coherent whole percept.

      Extraneous cognitive load refers to the level of working memory load that a person experiences due to the properties of materials or computer interfaces they are using. High levels of extraneous cognitive load can undermine a user’s primary task performance. Extraneous load is distinguished from (1) intrinsic cognitive load, or the inherent difficulty level and related working memory load associated with a user’s primary task, and (2) germane cognitive load, or the level of a student’s effort and activity compatible with mastering new domain content, which may either be supported or undermined by interface design (e.g., due to inappropriate automation).

      Hyperarticulation involves a stylized and clarified adaptation of a user’s typical unimodal speech, which she will shift into during disequilibrium—for example, when accommodating “at risk” listeners (e.g., hearing impaired), adverse communication environments (e.g., noisy), or interactions involving frequent miscommunication (e.g., error-prone spoken language systems). A user’s hyperarticulate speech to an error-prone speech system primarily involves a lengthier and more clearly articulated speech signal, as summarized in the CHAM model [Oviatt et al. 1998]. This type of hyper-clear unimodal speech adaptation is distinct from that observed when speech is combined multimodally, which involves multimodal hypertiming. In general, speakers hyperarticulate whenever they expect or experience a communication failure with their listener, which occurs during both interpersonal and humancomputer exchanges. When interacting with spoken dialogue systems, it is a major cause of system recognition failure, although it can be avoided by designing a multimodal interface.

      Limited resource theories focus on cognitive constraints, especially ones involving attention and working memory, that can act as bottlenecks limiting human processing. Examples of limited resource theories include Working Memory theory, Multiple Resource theory, and Cognitive Load theory. These and similar theories address how people adaptively conserve energy and mental resources, while striving to optimize performance on a task. They have been well supported by both behavioral and neuroscience data. Currently, Working Memory theory is most actively being researched and refined.

      Maximum-likelihood estimation (MLE) principle applies Bayes rule during multisensory fusion to determine the variance associated with individual input signals, asymmetry in signal variance, and the degree to which one sensory signal dominates another in terms of the final multimodal percept. The MLE model also estimates variance associated with the combined multimodal percept, and the magnitude of any super-additivity observed in the final multimodal percept (see separate entry on super-additivity).

      Multimodal hypertiming involves adaptation of a user’s typical multimodal construction, for example when using speech and writing, which she will shift into during disequilibrium. For example, when interacting with an error-prone multimodal system, a user’s input will adapt to accentuate or fortify their habitual pattern of signal co-timing. Since there is a bimodal distribution of users who either demonstrate a simultaneous or sequential pattern of multimodal signal co-timing, this means that (1) simultaneous integrators, whose input signals overlap temporally, will increase their total signal overlap, but (2) sequential integrators, who complete one signal piece before starting another with a lag in between, will instead increase the total lag between signals. This multimodal hypertiming represents a form of entrenchment, or hyper-clear communication, that is distinct from that observed when communicating unimodally (see separate entry on hyperarticulation).

      Perception-action dynamic theories assert that perception, action, and consciousness are dynamically interrelated. They provide a holistic systems-level view of interaction between humans and their environment, including feedback processes as part of a dynamic loop. Examples of anti-reductionistic perception-action dynamic theories include Activity meta-theories, Embodied Cognition theory, Communication Accommodation theory, and Affordance theory. These theories claim that action may be either physical or communicative. In some cases, such as Communication Accommodation theory, they involve socially-situated theories. Perception-action dynamic theories have been well supported by both behavioral and neuroscience data, including research on mirror and echo neurons. Currently, Embodied Cognition theory is most actively being researched and refined, often in the context of human learning or neuroscience research. It asserts that representations involve activating neural processes that recreate a related action-perception experiential loop, which is based on multisensory perceptual and multimodal motor neural circuits in the brain [Nakamura et al. 2012]. During this feedback loop, perception of an action (e.g., writing a letter shape) primes motor neurons (e.g., corresponding finger movements) in the observer’s brain, which facilitates related comprehension (e.g., letter recognition and reading).

      Super-additivity refers to multisensory enhancement of the neural firing pattern when two sensory signals (e.g., auditory and visual) are both activated during a perceptual event. This can produce a total response larger than the sum of the two sources of modality-specific input, which improves the reliability of the fused signal. Closer spatial or temporal proximity can increase super-additivity, and the magnitude of super-additivity increases in adverse conditions (e.g., noise, darkness). The maximum-likelihood estimation (MLE) principle has been applied to estimate the degree and pattern of super-additivity. One objective of multimodal system design is to support maximum super-additivity.

      Research on multisensory integration has clarified that there are asymmetries during fusion in what type of signal input dominates a perceptual interpretation, and the degree to which it is weighted more heavily. In the temporal ventriloquism effect, asynchronous auditory and visual input can be fused by effectively binding an earlier visual stimulus into temporal alignment with a subsequent auditory one, as long as they occur within a given window of time [Morein-Zamir et al. 2003]. In this case, visual perception is influenced by auditory cues. In contrast, in the spatial ventriloquism effect the perceived location of a sound can be shifted toward a corresponding visual cue [Bertelson and deGelder 2004]. The maximum-likelihood estimation (MLE) principle of multisensory fusion, based on Bayes rule, has been used to estimate the degree to which one modality dominates another during signal fusion. This principle describes how signals are integrated in the brain to minimize variance in their interpretation, which maximizes the accuracy of the final multimodal interpretation. For example, during visual-haptic fusion, visual dominance occurs when the variance associated with visual estimation is lower than that for haptic estimation [Ernst and Banks 2002]. For further details, see Section 1.3.

      Multisensory integration research also has elaborated our understanding of how spatial and temporal proximity influence the salience of a multisensory percept. Neurons in the deep superior colliculus now are well known to exhibit multisensory enhancement in their firing patterns, or super-additivity [Anastasio and Patton 2004]. This can produce responses larger than the sum of the two modality-specific sources of input [Bernstein and Benoit 1996, Anastasio and Patton 2004]. Closer proximity of related signals can produce greater super-additivity. This phenomenon functions to improve the speed and accuracy of human responsiveness to objects and events, especially in adverse conditions such as noise or darkness [Calvert et al. 2004, Oviatt 2000, Oviatt 2012]. From an evolutionary perspective, these behavioral adaptations have directly supported human survival in many situations.

      In addition to promoting better understanding of multisensory perception, Gestalt theoretical principles have advanced research on users’ production of multimodal constructions during human-computer interaction. For example, studies of users’ multimodal spoken and written constructions confirm that integrated multimodal constructions are qualitatively distinct from their unimodal parts. In addition, Gestalt principles accurately predict the organizational cues that bind this type of multimodal construction [Oviatt et al. 2003]. In a pen-voice multimodal interface, a user’s speech input is an acoustic modality that is structured temporally. In