inherits major advantages of leaky ReLU, and is small at the system origin, which means a smoother/more robust performance with respect to noise than leaky ReLU. However, the computational overhead for ELU is greater than that for leaky ReLU due to the exponential factor in ELU.
3.1.4 Discrete convolution and weights
It is well known that a convolution is a linear operation, which is of great importance in mathematics. A discrete convolution is a weighted summation of components of a vector/matrix/tensor. In the signal processing field, the convolution is used to recognize a local pattern in an image by extracting local features and integrating them properly. There are often local correlations in images, and the convolution is to find a local linear correlation. It will become clear below that the multi-layer convolution network is a powerful multi-resolution analysis, being consistent with the inner-working of the HVS. The three most common types of convolution operations for signal processing are full convolution, same convolution, and valid convolution. Without loss of generality, in the 1D case let us assume that an input signal x∈Rn is a one-dimensional vector, and the filter w∈Rm is another one-dimensional vector, the convolution algorithm can be categorized into: 1. Full convolution
y=convx,w,“full”=(y(1),…,y(t),…,y(n+m−1))∈Rn+m−1y(t)=∑i=1mx(t−i+1)·w(i),t=1,2,…,n+m−1,(3.8)
where zero padding is applied as needed.
2. Same convolution
y=convx,w,“same”=centerconvx,w,“full”,n∈Rn.(3.9)
The result of the same convolution is the central part of the full convolution, which is of the same size as the input vector x.
3. Valid convolution
y=convx,w,“valid”=(y(1),…,y(t),…,y(n−m−1))∈Rn−m+1y(t)=∑i=1mx(t+i−1)·w(i),t=1,2,…,n−m+1,(3.10)
where n>m. In contrast to the full and same convolutions, no zero padding is involved in the valid convolution.
The ideas behind the one-dimensional convolutions can be extended to the 2D case. Assuming that a two-dimensional input image is X∈Rn×m and the two-dimensional filter is W∈Rs×k. Then, the discrete two-dimensional convolution operation can be represented as follows:
Y(p,t)=(X*W)(p,t)=∑i∑jX(i,j)·W(p−i,t−j),(3.11)
where * represents convolution and · represents multiplication. Likewise, the convolution operations (full, same, and valid) can be defined in higher dimensional cases.
In contrast to the convolution formulas given above, cross-correlation functions can be defined in nearly the same way as the convolution functions:
Y(p,t)=(X*W)(p,t)=∑i∑jX(p+i,t+j)·W(i,j).(3.12)
The difference between cross-correlation and convolution is whether the filter W is flipped or not. It is not common in the machine learning field to use convolution exactly, but instead we often process an image with a cross-correlation operation; that is, we do not flip the filter W. Without flipping W, we also call the operation convolution (rigorously, cross-correlation).
Figure 3.10 illustrates an example of a convolution operation (without flipping) on a 2D image.
Figure 3.10. Example of a 2D convolution operation (weight without flipping) on a 2D input image.
In the neural network, a convolution operation is specified with two accessory parameters, namely, stride and zero padding. Stride refers the step increment with which the filter window jumps from its current position to the next position. For example, in figure 3.10 the initial position of the window is at the first pixel, and then the second position is at the second pixel, thus stride = 2 − 1 = 1. Zero padding refers to the number of zeros appended to the original data along a dimensional direction. Generally speaking, when a valid convolution operation is combined with stride and zero padding, the output size is calculated as follows (without loss of generality, in the 2D case):
Y=X∗W∈Ru×vu=n−s+2·zeropaddingstride+1v=m−k+2·zeropaddingstride+1,(3.13)
where ‘⌊⌋’ represents a downward rounding.
In the early neural networks, the connection between layers is in a fully connected form; that is, each neuron is connected to all neurons in the previous layer, needing a large number of parameters. Improving upon the fully connected network, convolutional neural networks rely on convolutions, greatly reducing the number of parameters. The core of the convolution operation is that it reduces unnecessary weighting links, only keeps local connections, and shares weights across the field of view. Since the convolution operation is shift-invariant, the learned features tend to be robust without overfitting.
Actually, the convolution is an operation of feature extraction in the premise of specific weights, such as the redundancy-removed ZCA and ICA features presented in the previous chapters. Not limited to the low-level feature space, higher level features can also be obtained in this way for representing the image information semantically.
A special convolution: 1 × 1 convolution
Now, let us introduce a special convolution kernel, which is of 1 × 1. As mentioned above, convolution is a local weighted summation. In the 1 × 1 convolution, the local receptive field is 1 × 1. Therefore, 1 × 1 convolution is equivalent to the linear combination of feature maps. In the case of multi-channel and multiple convolution kernels, a 1 × 1 convolution mainly has two effects:
1 1 × 1 convolution can lead to dimension reduction. If a 1 × 1 convolution is applied after the pooling layer, its effect is also dimension reduction. Moreover, it can reduce the redundancy in feature maps, which are obtained after the processing in each layer of the network. In reference to Olshausen and Field’s work (Olshausen and Field 1996), the learnt sparse features can be considered as a linear combination of ZCA features, which is an example of feature scarcity.
2 Under the premise of keeping the feature scale unchanged (i.e. without loss of resolution), the activation function applied after 1 × 1 convolution can greatly increase the nonlinearity of the network, which helps to deepen the network.
Figure 3.11 helps illustrate the effects of the 1 × 1 convolution.
Figure 3.11. Example of a 1 × 1 convolution operation.
In figure 3.11, the number of input feature maps is 2, and their sizes are all 5 × 5. After three 1 × 1 convolution operations, the number of the output feature maps is 3, and their sizes are 5 × 5. It is seen that the 1 × 1 convolutions realize the linear combinations of multiple feature maps while keeping the feature map size intact, realizing cross-channel interaction and information integration.
Furthermore, in figure 3.12, we combine the 3 × 3 convolution with the 1 × 1 convolution. Assuming that the number of the input feature maps of w×s is 128, the computational complexity on the left is w×h×128×3 × 3×128=147456×w×h, and that on the right is w×h×128×1×1 × 32+w×h×32×3 × 3×32+w×h×32×1 × 1×128=17408×w×h. The number of parameters on the left is approximately 8.5 times that on the right. Therefore, the use of 1 × 1 convolution causes dimension reduction and reduces the number of parameters.
Figure 3.12. Original 3 × 3 convolution, and improved 3 × 3 convolution combined with two 1 × 1 convolutions.
In addition, after 1 × 1 convolution a new activation function for nonlinear transformation takes effect. Therefore, 1 × 1 convolution makes