Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

v Subscript m Baseline right-parenthesis"/>, bold-italic upper W equals left-parenthesis bold-italic w 0 comma period period period comma bold-italic w Subscript m Baseline right-parenthesis , bold-italic x Subscript i Baseline equals left-parenthesis x Subscript i Superscript 0 Baseline comma x Subscript i Superscript 1 Baseline comma period period period comma x Subscript i Superscript p Baseline right-parenthesis , and tau left-parenthesis dot right-parenthesis and gamma left-parenthesis dot right-parenthesis are nonlinear functions.

Figure 1 An MLP with three layers.

Most often, tau left-parenthesis dot right-parenthesis and gamma left-parenthesis dot right-parenthesis are chosen to be the logistic function sigma left-parenthesis z right-parenthesis equals StartFraction 1 Over 1 plus e Superscript negative z Baseline EndFraction . This function is often chosen for the following desirable properties: (i) it is highly nonlinear, (ii) it is monotonically increasing, (iii) it is asymptotically bounded at some finite value in both the negative and positive directions, and (iv) its output lies in the interval left-parenthesis 0 comma 1 right-parenthesis , so that it stays relatively close to 0. However, Yann LeCun recommends that a different function be used: 1.7159 hyperbolic tangent left-parenthesis two thirds x right-parenthesis . This function retains all of the desirable properties of the logistic function and has the additional advantage of being symmetric about the origin, which results in outputs closer to 0 than the logistic function.

3.3 Training an MLP

We want to choose the weights and biases in such a way that they minimize the sum of squared errors within a given dataset. Similar to the general supervised learning approach, we want to find an optimal prediction delta Superscript asterisk Baseline left-parenthesis bold-italic upper X comma bold-italic upper W comma bold-italic upper V right-parenthesis such that

(1)

where bold-italic upper X equals left-parenthesis bold-italic x 1 comma bold-italic x 2 comma period period period comma bold-italic x Subscript n Baseline right-parenthesis , and script l left-parenthesis dot comma dot right-parenthesis is cross‐entropy loss.

(2) StartLayout 1st Row 1st Column script l left-parenthesis modifying above y with caret Subscript i Baseline comma y Subscript i Baseline right-parenthesis equals minus sigma-summation Underscript c equals 1 Overscript m Endscripts y Subscript i comma c Baseline log modifying above y with caret Subscript i comma c Baseline 2nd Column Blank EndLayout

where is the total number of classes; y Subscript i comma c Baseline equals 1 if the th sample belongs to class , otherwise it is equal to 0; and modifying above y with caret Subscript i comma c is the predicted score of the th sample belonging to class .

Function (1) cannot be minimized through differentiation, so we must use gradient descent. The application of gradient descent to MLPs leads to an algorithm known as backpropagation. Most often, we use stochastic gradient descent as that is far faster. Note that backpropagation can be used to train different types of neural networks, not just MLP.

We would like to address the issue of possibly being trapped in local minima, as backpropagation is a direct application of gradient descent to neural networks, and gradient descent is prone to finding local minima, especially in high‐dimensional spaces. It has been observed in practice that backpropagation actually does not typically get stuck in local minima and generally reaches the global minimum. There do, however, exist pathological data examples in which backpropagation will not converge to the global minimum, so convergence to the global minimum is certainly not an absolute guarantee. It remains a theoretical mystery why backpropagation does in fact generally converge to the global minimum, and under what conditions it will do so. However, some theoretical results have been developed to address this question. In particular, Gori and Tesi [7] established that for linearly separable data, backpropagation will always converge to the global solution.

So far, we have discussed a simple MLP with three layers aimed at classification problems. However, there are many extensions to the simple case. In general, an MLP can have any number of hidden layers. The more hidden layers there are, the more complex the model, and therefore the more difficult it is to train/optimize the weights. The model remains almost exactly the same, except for the insertion of multiple hidden layers between the first hidden layer and the output layer. Values for each node in a given layer are determined in the same way as before, that is, as a nonlinear transformation of the values of the nodes in the previous layer and the associated weights. Training the network via backpropagation is almost exactly the same.

c03-gra-0003

4 Convolutional Neural Networks

4.1 Introduction

A CNN is a modified DNN that is particularly well equipped to handling image data. CNN usually contains not only fully connected layers but also convolutional layers and pooling layers, which make a difference. Image is a matrix of pixel values, which should be flattened to vectors before feeding into DNN as DNN takes a vector as input. However, spatial information might be lost in this process. The convolutional layer can take a matrix or tensor as input and is able to capture the spatial and temporal dependencies in an image.

In the convolutional layer, the weight matrix (kernel) scans over the input image to produce a feature matrix. This process is called convolution operation. The pooling layer operates similar to the convolutional layer and

Скачать книгу