Sequential Data

Sequential data

Many kinds of data are intrinsically sequential in nature. For example:

Text is composed of a sequence of words.
Videos are made up of sequences of images depicting successive snapshots of a scene.
A patient's medical record is made up of time-stamped medical events such as symptoms (e.g. headache, fever), diagnostic procedures (e.g. having a blood test), diagnoses (e.g. Streptococcal infection) and prescriptions (e.g. antibiotic).

There are many applications that involve sequential data of one form or another. We can characterise these applications according to whether the input or output is a sequence.

In image captioning, the task is to take a single image as input and produce a textual description as output, as in the example below. The textual output can be regarded as a sequence of words or a sequence of characters.

An image of someone riding a surfboard (left) maps to the caption 'A man riding a wave on top of a surfboard' (right) — **Figure 1.** Generating textual captions from images

In sentiment analysis, the task is to take a textual description as input and generate a sentiment value as output (e.g. positive, negative, five star, one star).

See image description — **Figure 2.** Mapping from text to positive or negative sentiment labels.

Text extract from IMDb review, 2019

A block of text from a film review on the left maps to the label 'positive' on the right. The film review text is as follows:

I was the person that saw all the hype and claims of masterpiece as overreacting and overblown excitement for another Joker based film. I thought this looked solidat best and even a bit too pretentious in the trailer, but here to say I was incredibly wrong. This is a massive achievement of cinema that's extremely rare in a day and age of cgi nonsense and reboots. While this is somewhat of a reboot of sorts, the standalone origin tale is impeccable from start to finish and echoes resemblance to the best joker origin comics from the past. Joaquin bleeds, sweats and cries his every drop into this magnificently dedicated performance. Heath Ledger would be proud. This is undoubtedly the greatest acting performance since Heath's joker. The directing and writing is slickly brilliant and the bleak settings and tones are palpable throughout. When this film was over the place was blown away and every audience member was awestruck that they witnessed a film that could still transport them into a character's world and very existence. Believe the hype. this is going to be revered as a transcending masterpiece of cinema.

In a similar task, the target is a vector of five numerical personality scores, as in the following example from the IBM Personality Insights service (now discontinued).

A block of text on the left maps to percentage scores against the 'big five' personality dimensions on the right. — **Figure 3.** Personality scores from text.

Example taken from IBM Personality Insights service.

A block of text from an individual (see below) maps to percentage scores against the 'big five' personality dimensions. The scores are: Openness 90%, Conscientiousness 90%, Extraversion 53%, Agreeableness 31%, Neuroticism 93%. The text is as follows:

Well, thank you very much, Jim, for this opportunity. I want to thank Governor Romney and the University of Denver for your hospitality. There are a lot of points I want to make tonight, but the most important one is that 20 years ago I became the luckiest man on Earth because Michelle Obama agreed to marry me. And so I just want wish...

A major task is machine translation from one language to another, for example from English to French, as in the following example.

Because the input and output are both sequences, this is sometimes referred to as a seq2seq task.

Another seq2seq task is text to speech, where the input is text and the output is a speech waveform (a sequence of audio intensity values).

The The text 'The next train to Leeds' is mapped to a speech waveform — **Figure 5.** Text to speech.

A final example is a text generator with no external input, which produces sequences of characters conforming to some language domain. Such a domain could for example be the writings of Shakespeare; the language generator is required to produce text in the style of Shakespeare without reproducing verbatim extracts from his written work.

Stochastic processes

In abstract terms, we can think about many of these tasks as being about prediction of the future given the past: predicting the next word in a sentence, the next rainfall map, or the occurrence of heart disease. We can represent this as a conditional probability distribution:

\[ p(\myvec{x}_t| \myvec{x}_1, \cdots , \myvec{x}_{t-1}) \]

Such distributions define a stochastic process. Given values for \(\myvec{x}_1, \myvec{x}_2, \cdots , \myvec{x}_{t-1}\) we can sample from the conditional distribution to generate different possible futures. Having sampled a specific value \(\myvec{x}_t\) for time \(t\), we can repeat the process going forward and sample \(\myvec{x}_{t+1}\) from the conditional distribution:

\[ p(\myvec{x}_{t+1}| \myvec{x}_1, \cdots , \myvec{x}_{t-1}, \myvec{x}_t) \]

By repeating this process, we generate an entire sequence from a given initial sequence, which could be of length one, or even empty, in which case we would sample from a distribution \(p(\myvec{x}_1)\) to generate the first element of the sequence.

Many simplifying assumptions are made to make such distributions tractable in practice. For example, in an n'th order Markov process, the conditional probability is assumed to depend only on the most recent n time steps:

\[ p(\myvec{x}_t| \myvec{x}_{t-n}, \cdots , \myvec{x}_{t-1}) \]

We will see three different ways to use the power of neural networks in modelling conditional distributions over sequences: recurrent neural networks, tranformers and temporal convolutional networks. The latter will be used for classifying text rather than predicting the future, but the principles are the same - we seek a probability distribution conditioned on a sequence.