6 and onward are more focused on the practical aspects of CNNs. Specifically, Chapter 6 presents state-of-the-art CNN architectures that have demonstrated excellent performances on a number of vision tasks. It also provides a comparative analysis and discusses their relative pros and cons. Chapter 7 goes in further depth regarding applications of CNNs to core vision problems. For each task, it discusses a set of representative works using CNNs and reports their key ingredients for success. Chapter 8 covers popular software libraries for deep learning such as Theano, Tensorflow, Caffe, and Torch. Finally, in Chapter 9, open problems and challenges for deep learning are presented along with a succinct summary of the book.
The purpose of the book is not to provide a literature survey for the applications of CNNs in computer vision. Rather, it succinctly covers key concepts and provides a bird’s eye view of recent state-of-the-art models designed for practical problems in computer vision.
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun January 2018
Acknowledgments
We would like to thank Gerard Medioni and Sven Dickinson, the editors of this Synthesis Lectures on Computer Vision series, for giving us an opportunity to contribute to this series. We greatly appreciate the help and support of Diane Cerra, Executive Editor at Morgan & Claypool, who managed the complete book preparation process. We are indebted to our colleagues, students, collaborators, and co-authors we worked with during our careers, who contributed to the development of our interest in this subject. We are also deeply thankful to the wider research community, whose work has led to major advancements in computer vision and machines learning, a part of which is covered in this book. More importantly, we want to express our gratitude toward the people who allowed us to use their figures or tables in some portions of this book. This book has greatly benefited from the constructive comments and appreciation by the reviewers, which helped us improve the presented content. Finally, this effort would not have been possible without the help and support from our families.
We would like to acknowledge support from Australian Research Council (ARC), whose funding and support was crucial to some of the contents of this book.
Salman Khan, Hossein Rahmani, Syed Afaq Ali Shah, and Mohammed Bennamoun January 2018
CHAPTER 1
Introduction
Computer Vision and Machine Learning have played together decisive roles in the development of a variety of image-based applications within the last decade (e.g., various services provided by Google, Facebook, Microsoft, Snapchat). During this time, the vision-based technology has transformed from just a sensing modality to intelligent computing systems which can understand the real world. Thus, acquiring computer vision and machine learning (e.g., deep learning) knowledge is an important skill that is required in many modern innovative businesses and is likely to become even more important in the near future.
1.1 WHAT IS COMPUTER VISION?
Humans use their eyes and their brains to see and understand the 3D world around them. For example, given an image as shown in Fig. 1.1a, humans can easily see a “cat” in the image and thus, categorize the image (classification task); localize the cat in the image (classification plus localization task as shown in Fig. 1.1b); localize and label all objects that are present in the image (object detection task as shown in Fig. 1.1c); and segment the individual objects that are present in the image (instance segmentation task as shown in Fig. 1.1d). Computer vision is the science that aims to give a similar, if not better, capability to computers. More precisely, computer vision seeks to develop methods which are able to replicate one of the most amazing capabilities of the human visual system, i.e., inferring characteristics of the 3D real world purely using the light reflected to the eyes from various objects.
However, recovering and understanding the 3D structure of the world from two-dimensional images captured by cameras is a challenging task. Researchers in computer vision have been developing mathematical techniques to recover the three-dimensional shape and appearance of objects/scene from images. For example, given a large enough set of images of an object captured from a variety of views (Fig. 1.2), computer vision algorithms can reconstruct an accurate dense 3D surface model of the object using dense correspondences across multiple views. However, despite all of these advances, understanding images at the same level as humans still remains challenging.
1.1.1 APPLICATIONS
Due to the significant progress in the field of computer vision and visual sensor technology, computer vision techniques are being used today in a wide variety of real-world applications, such as intelligent human-computer interaction, robotics, and multimedia. It is also expected that the next generation of computers could even understand human actions and languages at the same level as humans, carry out some missions on behalf of humans, and respond to human commands in a smart way.
Figure 1.1: What do we want computers to do with the image data? To look at the image and perform classification, classification plus localization (i.e., to find a bounding box around the main object (CAT) in the image and label it), to localize all objects that are present in the image (CAT, DOG, DUCK) and to label them, or perform semantic instance segmentation, i.e., the segmentation of the individual objects within a scene, even if they are of the same type.
Figure 1.2: Given a set of images of an object (e.g., upper human body) captured from six different viewpoints, a dense 3D model of the object can be reconstructed using computer vision algorithms.
Human-computer Interaction
Nowadays, video cameras are widely used for human-computer interaction and in the entertainment industry. For instance, hand gestures are used in sign language to communicate, transfer messages in noisy environments, and interact with computer games. Video cameras provide a natural and intuitive way of human communication with a device. Therefore, one of the most important aspects for these cameras is the recognition of gestures and short actions from videos.
Robotics
Integrating computer vision technologies with high-performance sensors and cleverly designed hardware has given rise to a new generation of robots which can work alongside humans and perform many different tasks in unpredictable environments. For example, an advanced humanoid robot can jump, talk, run, or walk up stairs in a very similar way a human does. It can also recognize and interact with people. In general, an advanced humanoid robot can perform various activities that are mere reflexes for humans and do not require a high intellectual effort.
Multimedia
Computer vision technology plays a key role in multimedia applications. These have led to a massive research effort in the development of computer vision algorithms for processing, analyzing, and interpreting multimedia data. For example, given a video, one can ask “What does this video mean?”, which involves a quite challenging task of image/video understanding and summarization. As another example, given a clip of video, computers could search the Internet and get millions