when one gets tired of watching a long movie, computers would automatically summarize the movie for them.
1.1.2 IMAGE PROCESSING VS. COMPUTER VISION
Image processing can be considered as a preprocessing step for computer vision. More precisely, the goal of image processing is to extract fundamental image primitives, including edges and corners, filtering, morphology operations, etc. These image primitives are usually represented as images. For example, in order to perform semantic image segmentation (Fig. 1.1), which is a computer vision task, one might need to apply some filtering on the image (an image processing task) during that process.
Unlike image processing, which is mainly focused on processing raw images without giving any knowledge feedback on them, computer vision produces semantic descriptions of images. Based on the abstraction level of the output information, computer vision tasks can be divided into three different categories, namely low-level, mid-level, and high-level vision.
Low-level Vision
Based on the extracted image primitives, low-level vision tasks could be preformed on images/videos. Image matching is an example of low-level vision tasks. It is defined as the automatic identification of corresponding image points on a given pair of the same scene from different view points, or a moving scene captured by a fixed camera. Identifying image correspondences is an important problem in computer vision for geometry and motion recovery.
Another fundamental low-level vision task is optical flow computation and motion analysis. Optical flow is the pattern of the apparent motion of objects, surfaces, and edges in a visual scene caused by the movement of an object or camera. Optical flow is a 2D vector field where each vector corresponds to a displacement vector showing the movement of points from one frame to the next. Most existing methods which estimate camera motion or object motion use optical flow information.
Mid-level Vision
Mid-level vision provides a higher level of abstraction than low-level vision. For instance, inferring the geometry of objects is one of the major aspects of mid-level vision. Geometric vision includes multi-view geometry, stereo, and structure from motion (SfM), which infer the 3D scene information from 2D images such that 3D reconstruction could be made possible. Another task of mid-level vision is visual motion capturing and tracking, which estimate 2D and 3D motions, including deformable and articulated motions. In order to answer the question “How does the object move?,” image segmentation is required to find areas in the images which belong to the object.
High-level Vision
Based on an adequate segmented representation of the 2D and/or 3D structure of the image, extracted using lower level vision (e.g., low-level image processing, low-level and mid-level vision), high-level vision completes the task of delivering a coherent interpretation of the image. High-level vision determines what objects are present in the scene and interprets their interrelations. For example, object recognition and scene understanding are two high-level vision tasks which infer the semantics of objects and scenes, respectively. How to achieve robust recognition, e.g., recognizing object from different viewpoint is still a challenging problem.
Another example of higher level vision is image understanding and video understanding. Based on information provided by object recognition, image and video understanding try to answer questions such as “Is there a tiger in the image?” or “Is this video a drama or an action?,” or “Is there any suspicious activity in a surveillance video?” Developing such high-level vision tasks helps to fulfill different higher level tasks in intelligent human-computer interaction, intelligent robots, smart environment, and content-based multimedia.
1.2 WHAT IS MACHINE LEARNING?
Computer vision algorithms have seen a rapid progress in recent years. In particular, combining computer vision with machine learning contributes to the development of flexible and robust computer vision algorithms and, thus, improving the performance of practical vision systems. For instance, Facebook has combined computer vision, machine learning, and their large corpus of photos, to achieve a robust and highly accurate facial recognition system. That is how Facebook can suggest who to tag in your photo. In the following, we first define machine learning and then describe the importance of machine learning for computer vision tasks.
Machine learning is a type of artificial intelligence (AI) which allows computers to learn from data without being explicitly programmed. In other words, the goal of machine learning is to design methods that automatically perform learning using observations of the real world (called the “training data”), without explicit definition of rules or logic by the humans (“trainer”/“supervisor”). In that sense, machine learning can be considered as programming by data samples. In summary, machine learning is about learning to do better in the future based on what was experienced in the past.
A diverse set of machine learning algorithms has been proposed to cover the wide variety of data and problem types. These learning methods can be mainly divided into three main approaches, namely supervised, semi-supervised, and unsupervised. However, the majority of practical machine learning methods are currently supervised learning methods, because of their superior performance compared to other counter-parts. In supervised learning methods, the training data takes the form of a collection of (data:x, label:y) pairs and the goal is to produce a prediction y* in response to a query sample x. The input x can be a features vector, or more complex data such as images, documents, or graphs. Similarly, different types of output y have been studied. The output y can be a binary label which is used in a simple binary classification problem (e.g., “yes” or “no”). However, there has also been numerous research works on problems such as multi-class classification where y is labeled by one of k labels, multi-label classification where y takes on simultaneously the K labels, and general structured prediction problems where y is a high-dimensional output, which is constructed from a sequence of predictions (e.g., semantic segmentation).
Supervised learning methods approximate a mapping function f(x) which can predict the output variables y for a given input sample x. Different forms of mapping function f(.) exist (some are briefly covered in Chapter 2), including decision trees, Random Decision Forests (RDF), logistic regression (LR), Support Vector Machines (SVM), Neural Networks (NN), kernel machines, and Bayesian classifiers. A wide range of learning algorithms has also been proposed to estimate these different types of mappings.
On the other hand, unsupervised learning is where one would only have input data X and no corresponding output variables. It is called unsupervised learning because (unlike supervised learning) there are no ground-truth outputs and there is no teacher. The goal of unsupervised learning is to model the underlying structure/distribution of data in order to discover an interesting structure in the data. The most common unsupervised learning method is the clustering approach such as hierarchical clustering, k-means clustering, Gaussian Mixture Models (GMMs), Self-Organizing Maps (SOMs), and Hidden Markov Models (HMMs).
Semi-supervised learning methods sit in-between supervised and unsupervised learning. These learning methods are used when a large amount of input data is available and only some of the data is labeled. A good example is a photo archive where only some of the images are labeled (e.g., dog, cat, person), and the majority are unlabeled.
1.2.1 WHY DEEP LEARNING?
While these machine learning algorithms have been around for a long time, the ability to automatically apply complex mathematical computations to large-scale data is a recent development. This is because the increased power of today’s computers, in terms of speed and memory, has helped machine learning techniques evolve to learn from a large corpus of training data. For example, with more computing power and a large enough memory, one can create neural networks of many layers, which are called deep neural networks.