Professor Ge Wang

Machine Learning for Tomographic Imaging


Скачать книгу

stronger. Such a network can extract and express more complex and higher dimensional features.

      Transposed convolution

      The transposed convolution performs a transformation from the opposite direction of a normal convolution, i.e. transforms the output of a convolution into something similar to its input. The transposed convolution constructs the same connection pattern as a normal convolution, except that this is connected from the reverse direction. With the transposed convolution, one can expand the size of an input for up-sampling of feature maps (from a low to high resolution feature map).

      To explain the transposed convolution, we take an example shown in figure 3.13. It is already known that a convolution can be expressed as a matrix multiplication. If an input X and an output Y are unrolled into column vectors, and the convolution kernels are represented as a sparse matrix C (normally, the convolution kernel is local), then the convolution operation can be written as

      where the matrix C can be written as

      C=w1w20w3w400000w1w20w3w4000000w1w20w3w400000w1w20w3w4.(3.15)

      Using this representation, the transposed matrix C⊤ is easily obtained for transposed convolution. Then, we have the output X′ of the transposed convolution expressed as

      X′=C⊤Y.(3.16)

image

      Figure 3.13. Convolution kernel (left), normal convolution (middle), and transposed convolution (right). The input is in blue and the output is in green.

      It is worth mentioning that the output X′ of the transposed convolution does not need to be equal to the input X, but they maintain the same connectivity. In addition, the actual weight values in the transposed convolution do not necessarily copy those for the corresponding convolution. When training a convolution neural network, the weight parameters of the transposed convolution can be iteratively updated.

      Essentially, a pooling operation executes the aggregation of feature types, reducing the dimensionality of the feature space. In neurological terms, neurons aggregate and process bioelectrical signals of various bioelectricity rates from other neurons which is equivalent to pooling. The max pooling rate is to process the signal which has the highest bioelectricity rate, while the mean pooling gives the average of involved signals. Similarly, the pooling strategy in artificial neural networks is to compress features, accelerate the computation, allow translational invariance, and reduce the risk of overfitting. Pooling operations can be in many forms, such as max pooling, mean pooling, stochastic pooling, etc.

      1 Max pooling: Select the maximum value within an image window as the value of the corresponding pixel/voxel.

      2 Mean pooling: Calculate the average value of an image window as the value of the corresponding pixel/voxel.

      3 Stochastic pooling: Stochastic pooling first computes the probabilities for each region (Zeiler and Fergus 2013). In a simple way, the probability for each pixel can be calculated by dividing the pixel value by the sum of the values in the pooling window. Then, it randomly selects one value within each pooling region according to the probability distribution. Among these values, the one with the largest probability will be selected, but it is not to say that the largest value must be selected.

      Generally speaking, mean pooling often retains the overall characteristics of the data and protrudes the background information, max pooling can reveal the textural information, and stochastic pooling has the advantages of max pooling and partially avoids the excessive distortion caused by max pooling. Figure 3.14 illustrates these three pooling strategies.

image

      Figure 3.14. Illustration of the three types of pooling strategies.

      The loss function is critical to guide the process of training an artificial neural network. The loss is used to measure the discrepancy between a predicted value yˆ and the corresponding label y. It is a non-negative function, whose minimization drives the performance of the network reaching convergence in the training stage. Training a neural network is to update the network parameters so that yˆ approaches y as closely as possible by some certain measure. The local slope or more general gradient by which the loss value changes at a current parametric setting will tell us how to update the parameters for a reduced loss. That is, we use the loss function to compute a clue by which we refine our parameters. The loss function is defined in terms of labels as follows:

      L(θ)=1n∑i=1nLy(i),fx(i),θ,(3.17)

      where [x(i)=x1i,x2i,…,xmi]∈Rm denotes a training sample, y(i) denotes the corresponding label or a gold standard, θ is a set of parameters to be learned, and f(·) is the model function. The loss function can take a variety of forms as the definition of discrepancy is not unique. Next, we introduce several commonly used loss functions.

      Mean squared error/L2

      The mean squared error (MSE) is widely used in linear regression as the performance measure. The method for minimizing MSE is called ordinary least squares (OSL), which minimizes the sum of squared distances from data points to the regression line. The MSE function takes all errors into consideration with the same or different weights. The standard form of the MSE function is defined as

      L=1n∑i=1ny(i)−yˆ(i)2,(3.18)

      where (y(i)−yˆ(i)) is also referred to residual and is used to minimize the sum of squared residuals. Note that more rigorously, the normalizing factor 1/n should be 1/(n − 1) to eliminate any bias of the estimator L.

      Mean absolute error/L1

      The mean absolute error (MAE) (Chai and Draxler 2014) is computed as

      L=1n∑i=1ny(i)−yˆ(i),(3.19)

      where ∣·∣ denotes the absolute value. Although both MSE and MAE are used in predictive modeling, there are several differences between them. First, MAE is more complicated for computing the gradient than MSE. Also, MSE focuses more on large errors whose consequences are much larger than smaller ones, due to the squaring operation. In practice, some larger errors could be outliers that should be ignored. Instead, MAE treats all errors linearly so that it is more robust to outliers.

      Similar to MAE, the L1 loss function is the sum of absolute differences between actual and predicted values. L1 does not have the normalizing factor n (or n – 1). That is, the L1 loss is defined as

      L=∑i=1ny(i)−yˆ(i).(3.20)

      It is underlined that the L1 loss is extremely important in the field of compressed sensing. The minimization of the L1 loss leads to a sparse solution, which is considered a major breakthrough in the signal processing field.

      Mean absolute percentage error

      The mean absolute percentage error (MAPE) (De Myttenaere et al 2016) is a variant of MAE, which is computed as

      L=1n∑i=1ny(i)−yˆ(i)y(i)·100.(3.21)

      MAPE is used to measure the percentage error between predicted and real values. The idea of MAPE is quite simple and convincing, but it is not commonly used in practical applications due to some major drawbacks. For instance, it cannot be used if there are zero values, which would mean a division by zero. Moreover, it is not upper bounded.