Sharon Oviatt

The Handbook of Multimodal-Multisensor Interfaces, Volume 1


Скачать книгу

11.7 Based on: N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. 2015. Multi-scale deep learning for gesture detection and localization. In L. Agapito, M. M. Bronstein, and C. Rother, editors, Computer Vision—ECCV 2014 Workshops, volume LNCS 8925, pp. 474–490.

      Figure 11.8 Based on: D. Yu and L. Deng. 2011. Deep learning and its applications to signal and information processing [exploratory DSP]. IEEE Signal Processing Magazine, 28(1): 145–154.

      Figure 11.11 From: Pavlakos, S. Theodorakis, V. Pitsikalis, A. Katsamanis, and P. Maragos. 2014. Kinect-based multimodal gesture recognition using a two-pass fusion scheme. In Proceedings of the International Conference on Image Processing, pp. 1495–1499. Copyright © 2014 IEEE. Used with permission.

      Figure 11.11 (video) Courtesy of Stavros Theodorakis. Used with permission.

      Figure 12.1 Based on G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9): 1306–1326.

      Figure 12.2a From: J. Huang, G. Potamianos, J. Connell, and C. Neti. 2004. Audio-visual speech recognition using an infrared headset. Speech Communication, 44(4): 83–96. Copyright © 2004 Elsevier B.V. Used with permission.

      Figure 12.2b (top) Courtesy of iStock.com/kursatunsal

      Figure 12.2b (middle) Courtesy of iStock.com/Stratol

      Figure 12.2b (bottom) Courtesy of FLIR Systems, Inc.

      Figure 12.6 Based on G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9): 1306–1326.

      Figure 12.7 Based on: E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel. 2015b. Scattering vs. discrete cosine transform features in visual speech processing. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 175–180.

      Figure 12.9 Based on: S. Thermos and G. Potamianos. 2016. Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 579–584.

      Figure 12.10 Based on: E. Marcheret, G. Potamianos, J. Vopicka, and V. Goel. 2015b. Scattering vs. discrete cosine transform features in visual speech processing. In Proceedings of the International Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing (FAAVSP), pp. 175–180.

       Introduction: Scope, Trends, and Paradigm Shift in the Field of Computer Interfaces

      During the past decade, multimodal-multisensor interfaces have become the dominant computer interface worldwide. They have proliferated especially rapidly in support of increasingly small mobile devices (for history, see Oviatt and Cohen [2015]). In that regard, they have contributed to the development of smartphones and other mobile devices, as well as their rapidly expanding ecosystem of applications. Business projections estimate that by 2020 smart phones with mobile broadband will increase in number from two to six billion, resulting in two-to-three times more smartphones in use than PCs, along with an explosion of related applications [Evans 2014]. At a deeper level, the co-evolution of mobile devices and the multimodal-multisensor interfaces that enable using them is transforming the entire technology industry [Evans 2014].

      One major reason why multimodal-multisensor interfaces have dominated on mobile devices is their flexibility. They support users’ ability to select a suitable input mode, or to shift among modalities as needed during the changing physical contexts and demands of continuous mobile use. Beyond that, individual mobile devices like smart phones now require interface support for a large and growing array of applications. In this regard as well, the flexibility of multimodal-multisensor interfaces has successfully supported extremely multifunctional use. These advantages of multimodal interfaces have been well known for over 15 years:

      In the area of mobile computing, multimodal interfaces will promote … the multi-functionality of small devices, in part due to the portability and expressive power of input modes. [Oviatt and Cohen 2000, p.52]

      Multimodal-multisensor interfaces likewise are ideal for supporting individual differences and universal access among users. With the global expansion of smart phones in third-world countries, this aspect of interface flexibility has contributed to the adoption of mobile devices by users representing different native languages, skill levels, ages, and sensory and cognitive impairments. All of the above flexible attributes have stimulated the paradigm shift toward multimodal-multisensor interfaces on computers today, which often is further enhanced by either multimodal output or multimedia output. See Glossary for defined terms.

      The transition to multimodal-multisensor interfaces has been a particularly seminal one in the design of digital tools. The single keyboard input tool has finally given way to a variety of input options, which now can be matched more aptly with different usage needs. Given human adeptness at developing and using a wide variety of physical tools, it is surprising that keyboard input (a throwback to the typewriter) has prevailed for so many decades as a single input option on computers.

      Let’s consider for a moment how our present transition in digital input tools parallels the evolution of multi-component physical tools, which occurred approximately 200,000–400,000 years ago in homo sapiens. The emergence of multi-component physical tools is considered a major landmark in human cognitive evolution, which co-occurred with a spurt in brain-to-body ratio and shaped our modern cognitive abilities [Epstein 2002, Wynn 2002]. During this earlier renaissance in the design of physical tools made of stone, bone, wood, and skins, the emergence of flexible multi-component tools enabled us to adapt tool design (1) for a variety of different specific purposes, (2) to substantially improve their performance, and (3) to improve their ease of use [Masters and Maxwell 2002]. This proliferation in the design of multi-component physical tools led homo sapiens to experience their differential impact, and to begin to recognize how specific design features contribute to achieving desired effects. For example, a lightweight wooden handle attached to a pointed stone hand-axe could be thrown a long distance for spearing large game. In this regard, the proliferation of tools stimulated a new awareness of the advantages of specific design features, and of the principles required to achieve a particular impact [Commons and Miller 2002].

       Glossary

      Multimedia output refers to system output involving two or more types of information received as feedback by a user during human-computer interaction, which may involve different types of technical media within one modality like vision—still images, virtual reality, video images—or it may involve multimodal output such as visual, auditory, and tactile feedback to the user.

      Multimodal input involves user input and processing of two or more modalities—such as speech, pen, touch and multi-touch, gestures, gaze, head and body movements, and virtual keyboard. These input modalities may coexist together on an interface, but be used either simultaneously or alternately [Oviatt and Cohen 2015]. The input may involve recognition-based technologies (e.g., speech, gesture), simpler discrete input (e.g., keyboard, touch), or sensor-based information (e.g., acceleration, pressure). Some modalities may be capable of expressing semantically rich information and creating new content (e.g., speech, writing, keyboard), while others are limited to making discrete selections and controlling the system display (e.g., touching a URL to open it, pinching gesture to shrink a visual display). These interfaces