Recurrent neural networks

Elman RNN

A recurrent neural network (RNN) is designed to operate with sequences. The simplest RNN is the so-called Elman (or vanilla) RNN the operation of which is illustrated in the following diagram:

The network consists of a simple feedforward network that we apply to an input sequence, one item at a time. The network has a single hidden layer, which receives input from the current input and from the previous hidden layer - this is the recurrence. The time-delay of the recurrent input is signified by the brown square on the corresponding link.

The computation at time \(t\) of an Elman RNN is given by:

\[ \myvec{o}_t = \myvec{c} + \mymat{V}\myvec{h}_t \]

\[ \myvec{h}_t = \tanh(\myvec{b} + \mymat{W}\myvec{h}_{t-1} + \mymat{U}\myvec{x}_t) \]

where \(\mymat{U},\mymat{V},\mymat{W}\) are weight matrices and \(\myvec{b},\myvec{c}\) are bias vectors.

It is sometimes convenient to rewrite \(\mymat{W}\myvec{h}_{t-1} + \mymat{U}_t\myvec{x}_t\) as \([\mymat{W},\mymat{U}][\myvec{h}_{t-1};\myvec{x}_t]\), where we are simply concatenating \(\mymat{W}\) and \(\mymat{U}\) side by side, and concatenating \(\myvec{h}_{t-1}\) and \(\myvec{x}_t\) one above the other.

To build a computational graph from an RNN, we operate (unroll) the network over a fixed number of time steps (\(T\) in this case), as illustrated below:

The RNN is repeated once for each timesteps. The hidden state is shown being passed forwards in time to following timestep. A zero vector is introduced in its place at the first timestep. — **Figure 2.** Unrolled RNN over multiple timesteps

Note that the parameters of the RNN are the same across all time steps. Because there is no hidden state vector from the previous time on the first time step, it is normally assumed that this is the zero-vector.

We create a deep RNN by simply stacking hidden layers on top of one another in the usual way. The operation of an RNN with two hidden layers is shown below:

A deep RNN with two hidden layers — **Figure 3.** A deep RNN having two intermediate layers with associated hidden states

Rolling this out for a fixed number of timesteps is done by analogy with the single layer RNN.

Building a text-clip generator

Our aim is to produce a generative model of language in the form of an RNN which has been trained on a corpus of text-clips (short pieces of text), the style of which we wish to capture. The RNN will process single characters at a time and when rolled out over several time steps looks like this:

A sequence of characters is input to an unrolled two-layer RNN, preceeded by a special <start> token. The desired output is the same sequence advanced by one timestep, with a special <end> token on the end. Note that the <start> token doesn't appear on the output. — **Figure 4.** using an RNN as a character generator

The RNN in this example has two hidden layers and accepts one of 28 possible tokens as input. The possible tokens correspond to the 26 letters in the English alphabet (A-Z) and the special tokens <start> and <end> which denote the beginning on input and the end of the text-clip on output. Input tokens are represented as one-hot vectors.

Assuming our RNN network has been trained, we generate a novel character sequence as follows. At the first timestep we input the <start> token to the RNN. The output is a probability distribution over tokens. We sample from this distribution and this becomes the first output token - in this case it is an 'h'. The process is repeated in subsequent timesteps, each time sampling from the distribution and transferring the chosen token to become the next input. The process repeats until either the token is chosen for output or a fixed limit on the number of timesteps is reached.

Input <start> and compute output distribution (using RNN with softmax)
Sample from the probability distribution
Set next input to be this value
Repeat sampling until <end> chosen or a fixed number of iterations

Instead of sampling from the distribution, we could simply take the token with the highest probability - the so-called greedy strategy. Of course, this would generate the same sequence every time and is therefore not particularly useful.

To train the network, we sample a large number of fixed-length characters sequences from our language corpus (e.g. 16 characters in length). Each example contributes to a batchwise loss as illustrated below (here assuming that 'hello' now appears in the language corpus). For each input sequence, the idea is to target an identical sequence on output, except advanced by one timestep as required for prediction. From the sequence of output distributions and corresponding target sequence, we compute a log-likelihood for the target character at each timestep and sum these to produce a log-likelihood for the entire sequence. The overall loss is then the per-example negative log likelihood summed over a batch of examples.

A similar diagram to Figure 4 except that the input tokens (characters, <start> and <end>) are represented by a one-hor encoding, and the output at each timestep is now a probability distribution over tokens. — **Figure 5.** Training the RNN text generator

Notice that unlike in testing where we are generating a novel sequence, we don't sample from the output distributions, but instead lookup the probabilities associated with the target characters in order to compute the likelihoods needed for the cross-entropy loss.

Building a text generator

A variation on the above approach is to do away with the special <start> and <end> tokens and simply use fixed length character strings in training. The character strings are sampled from a large corpus of text such as the Complete Works of Shakespeare, or textual data accessed from the Web. For each example, we drop the last character to produce an input sequence, and drop the first character to produce an output sequence. Thus, from the character string The cat sat on the mat sampled from our corpus, the input becomes The cat sat on the ma, and the output become he cat sat on the mat. In testing, instead of using a token to make the first prediction, we input a short piece of seed text and get the generator to extend this for as long as we wish.

To illustrate, the following generated texts comes from an RNN trained on the text of the book Alice's Adventures in Wonderland by Lewis Carroll. The seed text that starts the generator is shown in red. Although the style is vaguely reminiscent of the book, the generated text is entirely novel.

The queen looked at alice and she said to herself and she said to herself in a mourefully as she could not and sometimes to say it at all said the caterpillar. well to her like the while and the dormouse she said to herself in a mouse them the mock turtle said and so she said the mock turtle to say what i should like to the hatter with a sort of catter that the dormouse was a little began with a streak the dormo

RNNs with long-term memory

Consider the following partial sentence:

I grew up in France ... I speak fluent French

Generating this sentence with an RNN is challenging because predicting the final word 'French' requires information from earlier in the sentence. In principle an RNN can handle this since the occurrence of the word 'France' should be encoded into the hidden state vector when it occurs, and carried through to the point at which the word 'French' must be produced. In practice, this doesn't work well since the gradients computed by backpropagation get progressively smaller the further back in time one goes. This undermines the parameter adaptation that we need to retain a trace of the word 'France' having occurred. A solution is to introduce a more explicit kind of persistent memory into our recurrent network. In the following sections, we examine the two most common RNN architectures for doing this.

Long short-term memory

In a Long short-term Memory (LSTM), the simple Elman RNN cell is replaced by one that looks like this:

The cell architecture of an LSTM, which implements thet set of equations below. The diagram highlights the idea that the LSTM has a 'cell' state which comes in from the previous timestep and goes out to the next timestep, being attenuated on the way and with a new value added in that depends on the current input and previous hidden state. — **Figure 6.** LSTM cell architecture

\[ \begin{align} \myvec{f}_t &= \sigma (\mymat{W}_f \myvec{x}_t + \mymat{U}_f \myvec{h}_{t-1} + \myvec{b}_f ) \\ \myvec{i}_t &= \sigma (\mymat{W}_i \myvec{x}_t + \mymat{U}_i \myvec{h}_{t-1} + \myvec{b}_i ) \\ \myvec{o}_t &= \sigma (\mymat{W}_o \myvec{x}_t + \mymat{U}_o \myvec{h}_{t-1} + \myvec{b}_o ) \\ \\ \myvec{\tilde{c}}_t &= \tanh (\mymat{W}_c \myvec{x}_t + \mymat{U}_c \myvec{h}_{t-1} + \myvec{b}_c ) \\ \myvec{c}_t &= \myvec{f}_t \circ \myvec{c}_{t-1} + \myvec{i}_t \circ \myvec{\tilde{c}}_t \\ \myvec{h}_t &= \myvec{o}_t \circ \tanh(\myvec{c}_t) \end{align}\]

The symbol \(\circ\) means the Hadamard product between two vectors, which is simply the element-wise product. Thus \((0,0.5,1.0) \circ (3.0,4.0,2.0) = (0,2.0,2.0)\).

To understand how the LSTM works, let's look at the equations in turn.

The first three equations are affine functions of the input and previous hidden state vector - in other words, they are fully connected linear layers. The logistic function is applied to this, giving output vectors \(\myvec{f}_t\), \(\myvec{i}_t\) and \(\myvec{o}_t\) with components in the interval \((0,1)\). These output vectors will be used as 'gates' to control the flow of information through the LSTM.

The vector \(\myvec{c}\) is the ongoing memory of the LSTM - the cell state. The previous memory \(\myvec{c_{t-1}}\) enters the cell and is first attenuated (gated) by \(\myvec{ f}_t\) (i.e. the components of \(\myvec{c}_{t-1}\) are multiplied elementwise by the components of \(\myvec{f}_t\)). A new memory is added in and the result leaves the cell as \(\myvec{c}_t\). The memory addition is based on \(\myvec{\tilde{c}}\), which is itself an affine function of the input and previous hidden state followed by the hyperbolic tangent function (\(tanh\)). Before adding in, this new memory is attenuated by \(\myvec{i}_t\).

Finally, the output hidden state \(\myvec{h}_t\) is scaled element wise by the new cell state \(\myvec{c}_t\), attenuated by \(\myvec{o}_t\).

Just as for the vanilla RNN, a linear output layer at each time step can be added as a function of \(\myvec{h}_t\) (not shown).

The LSTM cell within a neural network is normally represented as follows:

Similar to the Elman RNN diagram except that the hidden state is replaced by an LSTM cell. — **Figure 7.** LSTM architecture diagram

Gated recurrent unit

The Gated Recurrent Unit (GRU) does away with the cell state of an LSTM, and simply uses the hidden state \(\myvec{h}_t\) for the persistent memory. The architecture of a GRU cell looks like this:

Simpler than the LSTM cell architecture. There is no cell state and everything hinges on the hidden state, which is shown as passing through as for the cell state in the LSTM. As for the LSTM, there is additional machinery to attenuate and add in a new value that depends on the current input and previous hidden state. — **Figure 8.** GRU cell architecture

\[ \begin{align} \myvec{z}_t &= \sigma (\mymat{W}_z\myvec{x}_t + \mymat{U}_z \myvec{h}_{t-1} + \myvec{b}_z ) \\ \myvec{r}_t &= \sigma (\mymat{W}_r\myvec{x}_t + \mymat{U}_r \myvec{h}_{t-1} + \myvec{b}_r ) \\ \\ \hat{\myvec{h}}_t &= \tanh(\mymat{W}_h\myvec{x}_t + \mymat{U}_h (\myvec{r}_t \circ \myvec{h}_{t-1}) + \myvec{b}_h) \\ \myvec{h}_t &= (1-\myvec{z}_t)\circ\myvec{h}_{t-1} + \myvec{z}_t\circ\hat{\myvec{h}}_t \end{align} \]

The first two equations are fully connected linear layers taking the input and previous hidden state vectors. The logistic function is applied to this, giving output vectors \(\myvec{z}_t\) and \(\myvec{r}_t\) with components in the interval \((0,1)\). These output vectors will be used as 'gates' to control the flow of information through the GRU.

The third equation produces a prospective value \(\hat{\myvec{h}}_t\) for the new hidden state vector as an affine function of the input and previous hidden state vector, except that the previous hidden state vector is gated by \(\myvec{r}_t\). This gating effectively determines the extent to which the previous hidden state influences the prospective value for the current hidden state.

Finally, the fourth equation computes a final hidden state \(\myvec{h}_t\) by blending together the previous hidden state and the prospective hidden state computed in the third equation. The proportions of either constituent depend on \(\myvec{z}_t\). The operation labelled '1-' subtracts the input elementwise from 1.

The state can be seen as a persistent memory which can pass through a timestep unchanged if \(\myvec{z}_t = 0\) (by the fourth equation). If on the other hand \(\myvec{z}_t = 1\), the value is replaced by a new value, which could be entirely independent of the previous state (i.e. when \(\myvec{r}_t = 0\)).