Группа авторов

Computational Statistics in Data Science


Скачать книгу

the same structure and weights. A simple example of the process can be written as

      (9)StartLayout 1st Row 1st Column bold-italic h Superscript left-parenthesis t right-parenthesis Baseline equals sigma left-parenthesis bold-italic upper W 1 bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W 2 bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b right-parenthesis 2nd Column Blank EndLayout

      where bold-italic upper W 1 and bold-italic upper W 2 are weight matrices of network upper N, sigma left-parenthesis dot right-parenthesis is an activation function, and bold-italic b is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output bold-italic h Superscript left-parenthesis upper T right-parenthesis can be passed into another network to make prediction. For a sequence‐to‐sequence model, ModifyingAbove bold-italic y With Ì‚ Superscript left-parenthesis t right-parenthesis can be generated based on bold-italic h Superscript left-parenthesis t right-parenthesis and then compared with bold-italic y Superscript left-parenthesis t right-parenthesis.

      However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length upper T is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss script l Superscript left-parenthesis upper T right-parenthesis is evaluated at t equals upper T, the gradient w.r.t. bold-italic upper W 1 calculated via backpropagation can be written as

      (10)StartLayout 1st Row 1st Column StartFraction delta script l Superscript left-parenthesis upper T right-parenthesis Baseline Over delta bold-italic upper W 1 EndFraction equals sigma-summation Underscript t equals 0 Overscript upper T Endscripts StartFraction delta script l Superscript left-parenthesis upper T right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis upper T right-parenthesis Baseline EndFraction left-parenthesis product Underscript j equals t plus 1 Overscript upper T Endscripts StartFraction delta bold-italic h Superscript left-parenthesis j right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline EndFraction right-parenthesis StartFraction delta bold-italic h Superscript left-parenthesis t right-parenthesis Baseline Over delta bold-italic upper W 1 EndFraction 2nd Column Blank EndLayout

      where product Underscript j equals t plus 1 Overscript upper T Endscripts StartFraction delta bold-italic h Superscript left-parenthesis j right-parenthesis Baseline Over delta bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline EndFraction is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so

      (11)StartLayout 1st Row 1st Column bold-italic h Superscript left-parenthesis j right-parenthesis Baseline equals hyperbolic tangent left-parenthesis bold-italic upper W 1 bold-italic x Superscript left-parenthesis j right-parenthesis Baseline plus bold-italic upper W 2 bold-italic h Superscript left-parenthesis j minus 1 right-parenthesis Baseline plus bold-italic b right-parenthesis 2nd Column Blank EndLayout

      6.3 Long Short‐Term Memory Networks

      Since the original LSTM model was introduced, many variants have been proposed. Forget gate was introduced in Gers et al. [20]. It has been proven effective and is standard in most LSTM architectures. The forwarding process of LSTM with a forget gate can be divided into two steps. In the first step, the following values are calculated:

      (12)StartLayout 1st Row 1st Column bold-italic z Superscript left-parenthesis t right-parenthesis 2nd Column equals hyperbolic tangent left-parenthesis bold-italic upper W Subscript 1 z Baseline bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W Subscript 2 z Baseline bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b Subscript z Baseline right-parenthesis 2nd Row 1st Column bold-italic i Superscript left-parenthesis t right-parenthesis 2nd Column equals sigma Subscript g Baseline left-parenthesis bold-italic upper W Subscript 1 i Baseline bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W Subscript 2 i Baseline bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b Subscript i Baseline right-parenthesis 3rd Row 1st Column bold-italic f Superscript left-parenthesis t right-parenthesis 2nd Column equals sigma Subscript g Baseline left-parenthesis bold-italic upper W Subscript 1 f Baseline bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic upper W Subscript 2 f Baseline bold-italic h Superscript left-parenthesis t minus 1 right-parenthesis Baseline plus bold-italic b Subscript f Baseline right-parenthesis 4th Row 1st Column bold-italic o Superscript left-parenthesis t right-parenthesis 2nd Column equals sigma Subscript g Baseline left-parenthesis bold-italic upper W Subscript 1 o Baseline bold-italic x Superscript left-parenthesis t right-parenthesis Baseline plus bold-italic </p>
						</div><hr>
						<div class= Скачать книгу