Salman Khan

A Guide to Convolutional Neural Networks for Computer Vision


Скачать книгу

with the closest scale.

      2. The gradient magnitude and orientation are computed for each image pixel at this scale.

      3. As shown in Fig. 2.5, an orientation histogram, which consists of 36 bins covering the 360° range of orientations, is built from the gradient orientations of pixels within a local region around the keypoint.

      4. The highest peak in the local orientation histogram corresponds to the dominant direction of the local gradients. Moreover, any other local peak that is within 80% of the highest peak is also considered as a keypoint with that orientation.

       Keypoint Descriptor

      The dominant direction (the highest peak in the histogram) of the local gradients is also used to create keypoint descriptors. The gradient orientations are rotated relative to the orientation of the keypoint and then weighted by a Gaussian with a variance of 1.5 × keypointscale. Then, a 16 × 16 neighborhood around the keypoint is divided into 16 sub-blocks of size 4 × 4. For each sub-block, an 8 bin orientation histogram is created. This results in a feature vector, called SIFT descriptor, containing 128 elements. Figure 2.6 illustrates the SIFT descriptors for keypoints extracted from an example image.

       Complexity of SIFT Descriptor

      In summary, SIFT tries to standardize all images (if the image is blown up, SIFT shrinks it; if the image is shrunk, SIFT enlarges it). This corresponds to the idea that if a keypoint can be detected in an image at scale σ, then we would need a larger dimension k σ to capture the same keypoint, if the image was up-scaled. However, the mathematical ideas of SIFT and many other hand-engineered features are quite complex and require many years of research. For example, Lowe [2004] spent almost 10 years on the design and tuning of the SIFT parameters. As we will show in Chapters 4, 5, and 6, CNNs also perform a series of transformations on the image by incorporating several convolutional layers. However, unlike SIFT, CNNs learn these transformation (e.g., scale, rotation, translation) from image data, without the need of complex mathematical ideas.

      Figure 2.5: A dominant orientation estimate is computed by creating a histogram of all the gradient orientations weighted by their magnitudes and then finding the significant peaks in this distribution.

      Figure 2.6: An example of the SIFT detector and descriptor: (left) an input image, (middle) some of the detected keypoints with their corresponding scales and orientations, and (right) SIFT descriptors–a 16 × 16 neighborhood around each keypoint is divided into 16 sub-blocks of 4 × 4 size.

      SURF [Bay et al., 2008] is a speeded up version of SIFT. In SIFT, the Laplacian of Gaussian is approximated with the DoG to construct a scale-space. SURF speeds up this process by approximating the LoG with a box filter. Thus, a convolution with a box filter can easily be computed with the help of integral images and can be performed in parallel for different scales.

       Keypoint Localization

      In the first step, a blob detector based on the Hessian matrix is used to localize keypoints. The determinant of the Hessian matrix is used to select both the location and scale of the potential keypoints. More precisely, for an image I with a given point p = (x, y), the Hessian matrix H(p, σ) at point p and scale σ, is defined as follows:

Image

      where Lxx (p, σ) is the convolution of the second-order derivative of the Gaussian, Image, with the image I at point p. However, instead of using Gaussian filters, SURF uses approximated Gaussian second-order derivatives, which can be evaluated using integral images at a very low computational cost. Thus, unlike SIFT, SURF does not require to iteratively apply the same filter to the output of a previously filtered layer, and scale-space analysis is done by keeping the same image and varying the filter size, i.e., 9 × 9, 25 × 15, 21 × 21, and 27 × 27.

      Then, a non-maximum suppression in a 3 × 3 × 3 neighborhood of each point in the image is applied to localize the keypoints in the image. The maxima of the determinant of the Hessian matrix are then interpolated in the scale and image space, using the method proposed by Brown and Lowe [2002].

       Orientation Assignment

      In order to achieve rotational invariance, the Haar wavelet responses in both the horizontal x and vertical y directions within a circular neighborhood of radius 6s around the keypoint are computed, where s is the scale at which the keypoint is detected. Then, the Haar wavelet responses in both the horizontal dx and vertical dy directions are weighted with a Gaussian centered at a keypoint, and represented as points in a 2D space. The dominant orientation of the keypoint is estimated by computing the sum of all the responses within a sliding orientation window of angle 60°. The horizontal and vertical responses within the window are then summed. The two summed responses is considered as a local vector. The longest orientation vector over all the windows determines the orientation of the keypoint. In order to achieve a balance between robustness and angular resolution, the size of the sliding window need to be chosen carefully.

       Keypoint Descriptor

      To describe the region around each keypoint p, a 20s × 20s square region around p is extracted and then oriented along the orientation of p. The normalized orientation region around p is split into smaller 4 × 4 square sub-regions. The Haar wavelet responses in both the horizontal dx and vertical dy directions are extracted at 5 × 5 regularly spaced sample points for each subregion. In order to achieve more robustness to deformations, noise and translation, The Haar wavelet responses are weighted with a Gaussian. Then, dx and dy are summed up over each subregion and the results form the first set of entries in the feature vector. The sum of the absolute values of the responses, |dx| and |dy|, are also computed and then added to the feature vector to encode information about the intensity changes. Since each sub-region has a 4D feature vector, concatenating all 4 × 4 sub-regions results in a 64D descriptor.

      Until recently, progress in computer vision was based on hand-engineering features. However, feature engineering is difficult, time-consuming, and requires expert knowledge on the problem domain. The other issue with hand-engineered features such as HOG, SIFT, SURF, or other algorithms like them, is that they are too sparse in terms of information that they are able to capture from an image. This is because the first-order image derivatives are not sufficient features for the purpose of most computer vision tasks such as image classification and object detection. Moreover, the choice of features often depends on the application. More precisely, these features do not facilitate learning from previous learnings/representations (transfer learning). In addition, the design of hand-engineered features is limited by the complexity that humans can put in it. All these issues are resolved using automatic feature learning algorithms such as deep neural networks, which will be addressed in the subsequent chapters (i.e., Chapters 3, 4,