of the greatest decrease, which is given by the gradient. The idea is that if we repeatedly do this, we will eventually arrive at a minimum. The algorithm guarantees a local minimum, but not necessarily a global one [4]; see Algorithm 1.
Gradient descent is often very slow in machine learning applications, as finding the true gradient of the error criterion usually involves iterating through the entire dataset. Since we need to calculate the gradient at each time step of the algorithm, this leads to having to iterate through the entire dataset a very large number of times. To speed up the process, we instead use a variation on gradient descent known as stochastic gradient descent. Stochastic gradient descent involves approximating the gradient at each time step with the gradient at a single observation, which significantly speeds up the process [5]; see Algorithm 2.
3 Feedforward Neural Networks
3.1 Introduction
A feedforward neural network, also known as a multilayer perceptron (MLP), is a popular supervised learning method that provides a parameterized form for the nonlinear map
3.2 Model Description
We start by describing a simple MLP with three layers, as depicted in Figure 1.
The bottom layer of a three‐layer MLP is called the input layer, with each node representing the respective elements of an input vector. The top layer is known as the output layer and represents the final output of the model, a predicted vector. Again, each node in the output layer represents the respective predicted score of different classes. The middle layer is called the hidden layer and captures the unobserved latent features of the input. This is the only layer where the number of nodes is determined by the user of the model, rather than the problem itself.
The directed edges in the network represent weights from a node in one layer to another node in the next layer. We denote the weight from a node
The value of each node in the hidden and output layers is determined as a nonlinear transformation of the linear combination of the values of the nodes in the previous layers and the weights from each of those nodes to the node of interest. That is, the value of
More formally, the map
where