Computational Statistics in Data Science. Группа авторов. Читать онлайн. Hotlib. HOTLIB.NET

Computational Statistics in Data Science

the same structure and weights. A simple example of the process can be written as

(9) StartLayout 1st Row 1st Column bold-italic h Superscript left-parenthesis t right-parenthesis Baseline equals sigma left-parenthesis bold-italic upper W 1 bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W 2 bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b right-parenthesis 2nd Column Blank EndLayout

where bold-italic upper W 1 and bold-italic upper W 2 are weight matrices of network upper N , sigma left-parenthesis dot right-parenthesis is an activation function, and bold-italic b is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output bold-italic h Superscript left-parenthesis upper T right-parenthesis can be passed into another network to make prediction. For a sequence‐to‐sequence model, ModifyingAbove bold-italic y With Ì‚ Superscript left-parenthesis t right-parenthesis can be generated based on and then compared with bold-italic y Superscript left-parenthesis t right-parenthesis .

However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length upper T is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss script l Superscript left-parenthesis upper T right-parenthesis is evaluated at t equals upper T , the gradient w.r.t. bold-italic upper W 1 calculated via backpropagation can be written as

(10)

where product Underscript j equals t plus 1 Overscript upper T Endscripts StartFraction delta bold-italic h Superscript left-parenthesis j right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline EndFraction is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so

(11)

Therefore, , and hyperbolic tangent Superscript prime is always smaller than 1. When upper T becomes larger, the gradient will get closer to zero, making it hard to train the network and update the weights with remote information. However, it is possible that relevant information is far apart in the sequence, so how to leverage remote information of a long sequence is important.

6.3 Long Short‐Term Memory Networks

To solve the problem of losing remote information, researchers proposed long short‐term memory (LSTM) networks. The idea of LSTM was introduced in Hochreiter and Schmidhuber [19], but it was applied to recurrent networks much later. The basic structure of LSTM is shown in Figure 9. It solves the problem of the vanishing gradient by introducing another hidden state bold-italic c Superscript left-parenthesis t right-parenthesis , which is called the cell state.

Since the original LSTM model was introduced, many variants have been proposed. Forget gate was introduced in Gers et al. [20]. It has been proven effective and is standard in most LSTM architectures. The forwarding process of LSTM with a forget gate can be divided into two steps. In the first step, the following values are calculated:

(12) Скачать книгу