Recurrent neural networks
Elman RNN
A recurrent neural network (RNN) is designed to operate with sequences. The simplest RNN is the so-called Elman (or vanilla) RNN the operation of which is illustrated in the following diagram:
Figure 1. Architecture of an Elman (or vanilla) RNN.
The network consists of a simple feedforward network that we apply to an input sequence, one item at a time. The network has a single hidden layer, which receives input from the current input and from the previous hidden layer - this is the recurrence. The time-delay of the recurrent input is signified by the brown square on the corresponding link.
The computation at time \(t\) of an Elman RNN is given by:
where \(\mymat{U},\mymat{V},\mymat{W}\) are weight matrices and \(\myvec{b},\myvec{c}\) are bias vectors.
It is sometimes convenient to rewrite \(\mymat{W}\myvec{h}_{t-1} + \mymat{U}_t\myvec{x}_t\) as \([\mymat{W},\mymat{U}][\myvec{h}_{t-1};\myvec{x}_t]\), where we are simply concatenating \(\mymat{W}\) and \(\mymat{U}\) side by side, and concatenating \(\myvec{h}_{t-1}\) and \(\myvec{x}_t\) one above the other.
To build a computational graph from an RNN, we operate (unroll) the network over a fixed number of time steps (\(T\) in this case), as illustrated below:
Figure 2. Unrolled RNN over multiple timesteps
Note that the parameters of the RNN are the same across all time steps. Because there is no hidden state vector from the previous time on the first time step, it is normally assumed that this is the zero-vector.
We create a deep RNN by simply stacking hidden layers on top of one another in the usual way. The operation of an RNN with two hidden layers is shown below:
Figure 3. A deep RNN having two intermediate layers with associated hidden states
Rolling this out for a fixed number of timesteps is done by analogy with the single layer RNN.
Building a text-clip generator
Our aim is to produce a generative model of language in the form of an RNN which has been trained on a corpus of text-clips (short pieces of text), the style of which we wish to capture. The RNN will process single characters at a time and when rolled out over several time steps looks like this:
Figure 4. using an RNN as a character generator
The RNN in this example has two hidden layers and accepts one of 28 possible tokens as input. The possible tokens correspond to the 26 letters in the English alphabet (A-Z) and the special tokens <start> and <end> which denote the beginning on input and the end of the text-clip on output. Input tokens are represented as one-hot vectors.
Assuming our RNN network has been trained, we generate a novel character sequence as follows. At the first timestep we input the <start> token to the RNN. The output is a probability distribution over tokens. We sample from
this distribution and this becomes the first output token - in this case it is an 'h'. The process is repeated in subsequent timesteps, each time sampling from the distribution and transferring the chosen token to become the
next input. The process repeats until either the
Input <start> and compute output distribution (using RNN with softmax)
Sample from the probability distribution
Set next input to be this value
Repeat sampling until <end> chosen or a fixed number of iterations
To train the network, we sample a large number of fixed-length characters sequences from our language corpus (e.g. 16 characters in length). Each example contributes to a batchwise loss as illustrated below (here assuming that 'hello' now appears in the language corpus). For each input sequence, the idea is to target an identical sequence on output, except advanced by one timestep as required for prediction. From the sequence of output distributions and corresponding target sequence, we compute a log-likelihood for the target character at each timestep and sum these to produce a log-likelihood for the entire sequence. The overall loss is then the per-example negative log likelihood summed over a batch of examples.
Figure 5. Training the RNN text generator
Notice that unlike in testing where we are generating a novel sequence, we don't sample from the output distributions, but instead lookup the probabilities associated with the target characters in order to compute the likelihoods needed for the cross-entropy loss.
Building a text generator
A variation on the above approach is to do away with the special <start> and <end> tokens and simply use fixed length character strings in training. The character strings are sampled from a large corpus of text such as the
Complete Works of Shakespeare, or textual data accessed from the Web. For each example, we drop the last character to produce an input sequence, and drop the first character to produce an output sequence. Thus, from the character
string The cat sat on the mat sampled from our corpus, the input becomes The cat sat on the ma, and the output become he cat sat on the mat. In testing, instead of using a
To illustrate, the following generated texts comes from an RNN trained on the text of the book Alice's Adventures in Wonderland by Lewis Carroll. The seed text that starts the generator is shown in red. Although the style is vaguely reminiscent of the book, the generated text is entirely novel.
The queen looked at alice and she said to herself and she said to herself in a mourefully as she could not and sometimes to say it at all said the caterpillar. well to her like the while and the dormouse she said to herself in a mouse them the mock turtle said and so she said the mock turtle to say what i should like to the hatter with a sort of catter that the dormouse was a little began with a streak the dormo
RNNs with long-term memory
Consider the following partial sentence:
I grew up in France ... I speak fluent French
Generating this sentence with an RNN is challenging because predicting the final word 'French' requires information from earlier in the sentence. In principle an RNN can handle this since the occurrence of the word 'France' should be encoded into the hidden state vector when it occurs, and carried through to the point at which the word 'French' must be produced. In practice, this doesn't work well since the gradients computed by backpropagation get progressively smaller the further back in time one goes. This undermines the parameter adaptation that we need to retain a trace of the word 'France' having occurred. A solution is to introduce a more explicit kind of persistent memory into our recurrent network. In the following sections, we examine the two most common RNN architectures for doing this.
Long short-term memory
In a Long short-term Memory (LSTM), the simple Elman RNN cell is replaced by one that looks like this:
Figure 6. LSTM cell architecture
The symbol \(\circ\) means the Hadamard product between two vectors, which is simply the element-wise product. Thus \((0,0.5,1.0) \circ (3.0,4.0,2.0) = (0,2.0,2.0)\).
To understand how the LSTM works, let's look at the equations in turn.
The first three equations are affine functions of the input and previous hidden state vector - in other words, they are fully connected linear layers. The logistic function is applied to this, giving output vectors \(\myvec{f}_t\), \(\myvec{i}_t\) and \(\myvec{o}_t\) with components in the interval \((0,1)\). These output vectors will be used as 'gates' to control the flow of information through the LSTM.
The vector \(\myvec{c}\) is the ongoing memory of the LSTM - the cell state. The previous memory \(\myvec{c_{t-1}}\) enters the cell and is first attenuated (gated) by \(\myvec{ f}_t\) (i.e. the components of \(\myvec{c}_{t-1}\) are multiplied elementwise by the components of \(\myvec{f}_t\)). A new memory is added in and the result leaves the cell as \(\myvec{c}_t\). The memory addition is based on \(\myvec{\tilde{c}}\), which is itself an affine function of the input and previous hidden state followed by the hyperbolic tangent function (\(tanh\)). Before adding in, this new memory is attenuated by \(\myvec{i}_t\).
Finally, the output hidden state \(\myvec{h}_t\) is scaled element wise by the new cell state \(\myvec{c}_t\), attenuated by \(\myvec{o}_t\).
Just as for the vanilla RNN, a linear output layer at each time step can be added as a function of \(\myvec{h}_t\) (not shown).
The LSTM cell within a neural network is normally represented as follows:
Figure 7. LSTM architecture diagram
Gated recurrent unit
The Gated Recurrent Unit (GRU) does away with the cell state of an LSTM, and simply uses the hidden state \(\myvec{h}_t\) for the persistent memory. The architecture of a GRU cell looks like this:
Figure 8. GRU cell architecture
The first two equations are fully connected linear layers taking the input and previous hidden state vectors. The logistic function is applied to this, giving output vectors \(\myvec{z}_t\) and \(\myvec{r}_t\) with components in the interval \((0,1)\). These output vectors will be used as 'gates' to control the flow of information through the GRU.
The third equation produces a prospective value \(\hat{\myvec{h}}_t\) for the new hidden state vector as an affine function of the input and previous hidden state vector, except that the previous hidden state vector is gated by \(\myvec{r}_t\). This gating effectively determines the extent to which the previous hidden state influences the prospective value for the current hidden state.
Finally, the fourth equation computes a final hidden state \(\myvec{h}_t\) by blending together the previous hidden state and the prospective hidden state computed in the third equation. The proportions of either constituent depend on \(\myvec{z}_t\). The operation labelled '1-' subtracts the input elementwise from 1.
The state can be seen as a persistent memory which can pass through a timestep unchanged if \(\myvec{z}_t = 0\) (by the fourth equation). If on the other hand \(\myvec{z}_t = 1\), the value is replaced by a new value, which could be entirely independent of the previous state (i.e. when \(\myvec{r}_t = 0\)).