Professor Ge Wang

Machine Learning for Tomographic Imaging


Скачать книгу

the gradient descent method, the adjustment of wij can be calculated according to the partial derivatives expressed in equation (3.38). In the (n + 1)th iteration, the update of wij can be computed as follows:

      wi,jn+1=wi,jn+Δwi,jn,(3.39)

      The change in weights reflects how much L will be changed by adjusting wij. If ∂L∂wij>0, an increase in wij leads to an increased L value. On the other hand, if ∂L∂wij<0, an increment in wij means a decrement in L. Clearly, to specify the updating speed or learning rate, a coefficient η is needed:

      Δwij=−η∂L∂wij=−ηδjoi.(3.40)

      One should choose a learning rate η>0. The product of a proper learning rate and the gradient, multiplied by −1 will guarantee that wij is refined in a way that will decrease the loss function. In other words, the change of wij by −η∂L∂wij will reduce the L value. For more details on the backpropagation algorithm, see Rumelhart et al (1986).

      The convolution neural network (CNN) is a popular kind of artificial neural network, which recently became a research focus in the fields of speech analysis and image recognition. In the HVS, there exist several information processing stages, from the V1 area, where simple cells have selective responses for directional structures, to the V4 area, where complex curvatures are identified. In the layer-by-layer process, the receptive field is gradually enlarged, and the image characteristics to which the neurons respond become more and more complicated. Inspired by the HVS, when an image is processed, the activities in the artificial neural network are made similar to those in the HVS. The CNN provides an excellent model for this mechanism of the HVS. That is to say, the convolution neural network extracts features layer by layer. The deeper the layer in the network, the more complex and higher dimensional the feature maps are.

      History

      The CNN dates back to the papers by Hubel and Wiesel in the late 1960s (Hubel and Wiesel 1968), in which they claimed that the visual cortex of cats and monkeys contains neurons that react individually to directional structures. Visual stimulus can affect a neighborhood of a single neuron, known as the receptive field. Adjacent cells have similar and overlapped receptive fields, the size and position of which vary, forming a complete visual spatial map. This justifies the use of local receptive fields in CNNs.

      In 1980, neocognition was proposed, marking the birth of the CNN, which introduced the concept of the receptive field in the artificial neural network (Fukushima 1980).

      In 1988, the shift-invariant neural network was proposed to improve the performance of the CNN, which can successfully complete the object recognition in the existence of displacements or slight deformations of objects (Waibel et al 1989). The feed-forward architecture of CNN was then extended in the neural abstraction pyramid by lateral and feedback connections. The resultant recurrent convolutional network allows for incorporation of contextual information to resolve local ambiguities iteratively. In contrast to the previous models, image-like outputs at high resolution were generated.

      Finally, in 2005, there was a GPU implementation of CNN, making CNN much more effective and efficient (Steinkraus et al 2005). As a result, CNN entered its prime.

      Architecture

      CNN consists of input and output layers and a number of hidden layers. The hidden layers can be categorized by convolution, pooling, activation, and full connection. The input layer is generally a vector, matrix, or tensor. A convolutional layer is used to convolve an input layer and extract features at a higher level, while a pooling layer is for a sample to reduce the amount of data while maintaining critical information. An activation layer introduces nonlinear features. A fully connected layer integrates features obtained by convolution and pooling. Finally, the output layer produces the final output.

      Distinguishing features

      CNN has three distinguishing features: local connectivity, shared weight, and multiple feature maps. According to the concept of receptive fields, CNN exploits the spatial locality by enforcing a local connectivity between neurons of adjacent layers. This architecture ensures that the learned ‘filters’ produce the strongest responses to spatially local input patterns of relevance. Stacking many such layers together forms nonlinear filters that become increasingly global as the depth goes deeper (i.e. responsive to an increasingly larger region) so that the network first creates representations for local and primitive features of the input, and then assembles them for semantic and global features.

      In CNN, each filter is replicated across the entire visual field. These replicated units share the same parameters; that is, the same weight vector and bias is repeatedly used to produce a feature map. In a given convolutional layer, the features of interest for all neurons can be analyzed by a shift-invariant correlation. Replicating units in this way allows for the same feature to be detected regardless of their position in the visual field.

      Each CNN layer has neurons arranged in three dimensions: width, height, and depth. The width and height represent the size of a feature map. The depth represents the number of feature maps over the same receptive field, which offers different structural features in the same visual scope to respond to the visual stimuli of various types, respectively. Finally, different types of layers, both locally and completely connected, are stacked to form the CNN architecture.

      Together, these properties allow CNNs to achieve impressive trainability and generalizability on visual information processing problems. Local connectivity enforces the fact that correlations are often local. Weight sharing reflects the prior knowledge on shift invariance, dramatically reducing the number of free parameters, lowering the memory requirement for training the network, and enabling larger and deeper networks.

      A CNN example: LeNet-5

      Let us use the famous LeNet-5 network as an example to showcase the convolution neural network. Yan LeCun et al proposed the LeNet-5 model in 1998 as shown in figure 3.16 (Lecun et al 1998). This network is the first convolutional neural network with a classic result in the field. It is deep and very successful for handwritten character recognition. It is widely used by banks in the United States to identify handwritten digits on checks.

image

      Figure 3.16. LeNet-5 network for digit recognition. Adapted from Lecun et al (1998). Reproduced with permission. Copyright IEEE 1998.

      LeNet-5 has seven layers in total, each of which contains trainable parameters. Each layer produces multiple feature maps, and each feature map extracts features through convolution. The input data are a handwritten dataset, which is divided into a training set of 60 000 images in ten classes, and a testing set of 10 000 images in the same ten classes. The network outputs probabilities corresponding to the ten classes respectively, in the final layer, allowing it to predict a digit images class using the softmax function. More specifics on LeNet-5 are as follows.

      1 Input layer: The input image is uniformly normalized to be 32 × 32 in size.

      2 C1 layer: The first convolution layer operates upon the input image with six convolution filters 5 × 5 in size, producing six feature maps 28 × 28 in size.

      3 S2 layer: Pooling with six 2 × 2 filters for down-sampling. The pooling layer is to sum the pixel values in each 2 × 2 moving window over the C1 layer. The S2 layer produces six feature maps