Sequential Data
Sequential data
Many kinds of data are intrinsically sequential in nature. For example:
- Text is composed of a sequence of words.
- Videos are made up of sequences of images depicting successive snapshots of a scene.
- A patient's medical record is made up of time-stamped medical events such as symptoms (e.g. headache, fever), diagnostic procedures (e.g. having a blood test), diagnoses (e.g. Streptococcal infection) and prescriptions (e.g. antibiotic).
There are many applications that involve sequential data of one form or another. We can characterise these applications according to whether the input or output is a sequence.
In image captioning, the task is to take a single image as input and produce a textual description as output, as in the example below. The textual output can be regarded as a sequence of words or a sequence of characters.
In sentiment analysis, the task is to take a textual description as input and generate a sentiment value as output (e.g. positive, negative, five star, one star).
In a similar task, the target is a vector of five numerical personality scores, as in the following example from the IBM Personality Insights service (now discontinued).
A major task is machine translation from one language to another, for example from English to French, as in the following example.
Because the input and output are both sequences, this is sometimes referred to as a seq2seq task.
Another seq2seq task is text to speech, where the input is text and the output is a speech waveform (a sequence of audio intensity values).
A final example is a text generator with no external input, which produces sequences of characters conforming to some language domain. Such a domain could for example be the writings of Shakespeare; the language generator is required to produce text in the style of Shakespeare without reproducing verbatim extracts from his written work.
Stochastic processes
In abstract terms, we can think about many of these tasks as being about prediction of the future given the past: predicting the next word in a sentence, the next rainfall map, or the occurrence of heart disease. We can represent this as a conditional probability distribution:
Such distributions define a stochastic process. Given values for \(\myvec{x}_1, \myvec{x}_2, \cdots , \myvec{x}_{t-1}\) we can sample from the conditional distribution to generate different possible futures. Having sampled a specific value \(\myvec{x}_t\) for time \(t\), we can repeat the process going forward and sample \(\myvec{x}_{t+1}\) from the conditional distribution:
By repeating this process, we generate an entire sequence from a given initial sequence, which could be of length one, or even empty, in which case we would sample from a distribution \(p(\myvec{x}_1)\) to generate the first element of the sequence.
Many simplifying assumptions are made to make such distributions tractable in practice. For example, in an n'th order Markov process, the conditional probability is assumed to depend only on the most recent n time steps:
We will see three different ways to use the power of neural networks in modelling conditional distributions over sequences: recurrent neural networks, tranformers and temporal convolutional networks. The latter will be used for classifying text rather than predicting the future, but the principles are the same - we seek a probability distribution conditioned on a sequence.