Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

of the greatest decrease, which is given by the gradient. The idea is that if we repeatedly do this, we will eventually arrive at a minimum. The algorithm guarantees a local minimum, but not necessarily a global one [4]; see Algorithm 1.

Gradient descent is often very slow in machine learning applications, as finding the true gradient of the error criterion usually involves iterating through the entire dataset. Since we need to calculate the gradient at each time step of the algorithm, this leads to having to iterate through the entire dataset a very large number of times. To speed up the process, we instead use a variation on gradient descent known as stochastic gradient descent. Stochastic gradient descent involves approximating the gradient at each time step with the gradient at a single observation, which significantly speeds up the process [5]; see Algorithm 2.

3 Feedforward Neural Networks

3.1 Introduction

A feedforward neural network, also known as a multilayer perceptron (MLP), is a popular supervised learning method that provides a parameterized form for the nonlinear map delta from an input to a predicted label [6]. The form of delta here can be depicted graphically as a directed layered network, where the directed edges go upward from nodes in one layer to nodes in the next layer. The neural network has been seen to be a very powerful model, as they are able to approximate any Borel measurable function to an arbitrary degree, provided that parameters are chosen correctly.

3.2 Model Description

We start by describing a simple MLP with three layers, as depicted in Figure 1.

The bottom layer of a three‐layer MLP is called the input layer, with each node representing the respective elements of an input vector. The top layer is known as the output layer and represents the final output of the model, a predicted vector. Again, each node in the output layer represents the respective predicted score of different classes. The middle layer is called the hidden layer and captures the unobserved latent features of the input. This is the only layer where the number of nodes is determined by the user of the model, rather than the problem itself.

The directed edges in the network represent weights from a node in one layer to another node in the next layer. We denote the weight from a node x Superscript i in the input layer to a node h Superscript j in the hidden layer as w Subscript i j . The weight from a node in the hidden layer to a node modifying above y with caret Superscript k in the output layer will be denoted v Subscript j k . In each of the input and hidden layers, we introduce intercept nodes, denoted x Superscript 0 and h Superscript 0 , respectively. Weights from them to any other node are called biases. Each node in a given layer is connected by a weight to every node in the layer above except the intercept node.

The value of each node in the hidden and output layers is determined as a nonlinear transformation of the linear combination of the values of the nodes in the previous layers and the weights from each of those nodes to the node of interest. That is, the value of h Superscript j , j equals 1 comma period period period comma m , is given by gamma left-parenthesis bold-italic w Subscript j Superscript upper T Baseline bold-italic x right-parenthesis , where bold-italic w Subscript j Baseline equals left-parenthesis w Subscript 0 j Baseline comma period period period comma w Subscript p j Baseline right-parenthesis Superscript upper T , bold-italic x equals left-parenthesis 1 comma x Superscript 1 Baseline comma period period period comma x Superscript p Baseline right-parenthesis Superscript upper T , and gamma left-parenthesis dot right-parenthesis is a nonlinear transformation with range in the interval left-parenthesis 0 comma 1 right-parenthesis . Similarly, the value of modifying above y with caret Superscript k , k equals 1 comma period period period comma c , is given by tau left-parenthesis bold-italic v Subscript k Superscript upper T Baseline bold-italic h right-parenthesis , where bold-italic v Subscript k Baseline equals left-parenthesis v Subscript 0 k Baseline comma period period period comma v Subscript m k Baseline right-parenthesis Superscript upper T , bold-italic h equals left-parenthesis 1 comma h Superscript 1 Baseline comma period period period comma h Superscript m Baseline right-parenthesis Superscript upper T , and tau left-parenthesis dot right-parenthesis is also a nonlinear transformation with a range in the interval left-parenthesis 0 comma 1 right-parenthesis .

More formally, the map delta provided by an MLP from a sample bold-italic x Subscript i to modifying above y with caret Subscript i can be written as follows:

StartLayout 1st Row 1st Column delta left-parenthesis bold-italic x Subscript i Baseline comma bold-italic theta Subscript script upper M Baseline right-parenthesis equals modifying above y with caret Subscript i Baseline equals tau left-parenthesis bold-italic upper V Superscript upper T Baseline gamma left-parenthesis bold-italic upper W Superscript upper T Baseline bold-italic x Subscript i Baseline right-parenthesis right-parenthesis 2nd Column Blank EndLayout

where bold-italic upper V equals left-parenthesis bold-italic v 0 comma period period period comma bold-italic </p>
</div><hr>
<div class= Скачать книгу