href="#litres_trial_promo">Chapters 6 and 8, we will then see how more recent work focuses on integrating key ideas from these pre-neural approaches back into the novel deep learning paradigm.
1SWKP (http://simple.wikipedia.org
) is a corpus of simple texts targeting children and adults who are learning English Language and whose authors are requested to use easy words and short sentences.
2
http://en.wikipedia.org
3There are some exceptions. For example, the task of title generation [Filippova and Altun, 2013b] or sentence summarisation [Rush et al., 2015] can be treated as abstractive sentence compression.
CHAPTER 3
Deep Learning Frameworks
In recent years, deep learning, also called the neural approach, has been proposed for text production. The pre-neural approach generally relied on a pipeline of modules, each performing a specific subtask. The neural approach is very different from the pre-neural approach in that it provides a uniform (end-to-end) framework for text production. First the input is projected on a continuous representation (representation learning), and then, the generation process (generation) generates an output text using the input representation. Figure 3.1 illustrates this high-level framework used by neural approaches to text production.
One of the main strengths of neural networks is that they provide an amazing tool for representation learning. Representation learning often happens in a continuous space, such that different modalities of input, e.g., text (words, sentences, and even paragraphs), graphs, and tables are represented by dense vectors. For instance, given the user input “I am good. How about you? What do you do for a living?” in a dialogue setting, a neural network will first be used to create a representation of the user input. Then, in a second step—the generation step—this representation will be used as the input to a decoder which will generate the system response, “Ah, boring nine to five office job. Pays for the house I live in”, a text conditioned on that input representation. Representation learning aims at encoding relevant information from the input that is necessary to generate the output text. Neural networks have proven to be effective in representation learning without requiring directly extracting explicit features from the data. These networks operate as complex functions that propagate values (linear transformation of input values) through non-linear functions (such as the sigmoid or the hyperbolic tangent function) to get outputs that can be further propagated the same way to upper layers of the network.
This chapter introduces current methods in deep learning that are common in natural language generation. The goal is to give a basic introduction to neural networks in Section 3.1, and discuss the basic encoder-decoder approach [Cho et al., 2014, Sutskever et al., 2014] which has been the basis for much of the work on neural text production.
3.1BASICS
Central to deep learning is its ability to do representation learning by introducing representations that are expressed in terms of other simpler representations [Goodfellow et al., 2016].1 Typically, neural networks are organised in layers; each layer consists of a number of interconnected nodes; each node takes inputs from the previous layer and applies linear transformation followed by a nonlinear activation function. The network takes an input through the input layer which communicates with one or more hidden layers and finally produces model predictions through the output layer. Figure 3.2 illustrates an example deep learning system. A key characteristic of deep learning systems is that they can learn complex concepts from simpler concepts through the use of nonlinear activation functions; the activation function of a node in a neural network defines the output of that node given an input or set of inputs. Several hidden layers are often stacked to learn more and more complex and abstract concepts, leading to a deep network.
Figure 3.1: Deep learning for text generation.
Figure 3.2: Feed-forward neural network or multi-layer perceptron.
What we just explained here is essentially a feed-forward neural network or multi-layer perceptron. Basically, it learns a function mapping a set of input values from the input layer to a set of output values in the output layer. The function is formed by composing many linear functions through nonlinear activations.2 Such networks are called feed-forward because information always flows forward from the input layer to the output layer through the hidden layers in between. There is no autoregressive connection in which outputs of a layer are fed back to itself. Neural networks with autoregressive connections are often called recurrent neural networks (RNNs), they are widely explored for text production. We will discuss them later in this section.
3.1.1CONVOLUTIONAL NEURAL NETWORKS
Another type of neural networks, called convolutional neural networks, or CNNs [Lecun, 1989], are specialised for processing data that has a known grid-like topology. These networks have turned out to be successful in processing image data which can be represented as 2-dimensional grids of image pixels [Krizhevsky et al., 2012, Xu et al., 2015a], or time-series data from automatic speech recognition problems [Abdel-Hamid et al., 2014, Zhang et al., 2017]. In recent years, CNNs have also been applied to natural language. In particular, they have been use to effectively learn word representations for language modelling [Kim et al., 2016] and sentence representations for sentence classification [Collobert et al., 2011, Kalchbrenner et al., 2014, Kim, 2014, Zhang et al., 2015] and summarisation [Cheng and Lapata, 2016, Denil et al., 2014, Narayan et al., 2017, 2018a,c]. CNNs employ a specialised kind of linear operation called convolution, followed by a pooling operation, to build a representation that is aware of spatial interactions among input data points. Figure 3.3 from Narayan et al. [2018a] shows how CNNs can be used to learn a sentence representation. First of all, CNNs require input to be in a grid-like structure. For example, a sentence s of length k can be represented as a dense matrix W = [w1 ⊕ w2 ⊕ … ⊕ wk] ∈ Rk×d where wi ∈ Rd is a continuous representation of the ith word in s and ⊕ is the concatenation operator. We apply a one-dimensional convolutional filter K ∈ Rh×d of width h to a window of h words in s to produce a new feature.3 This filter is applied to each possible window of words in s to produce a feature map f = [f1, f2, … , fk—h+1] ∈ Rk—h+1, where fi is defined as:
where º is the Hadamard product, followed by a sum over all elements, ReLU is a rectified linear activation and b ∈ R is a bias term. ReLU activation function is often used as it is easier to train and often achieves better performance than sigmoid or tanh functions [Krizhevsky et al., 2012]. A max-pooling over time [Collobert et al., 2011] is applied over the feature map f to get fmax = max(f) as the feature corresponding to this particular filter K. Multiple filters Kh of width h are often used to compute a list of features fKh. In addition, filters of varying widths are applied to learn a set of feature lists
Figure 3.3: Convolutional neural network for sentence encoding.
We describe in Chapter 5 how such convolutional sentence encoders can be useful for better