Skip to content

Text classification

One of the most common forms of sequential data is found in natural language. We can treat a piece of text as a sequence made up of words or characters, partial words, or even a mixture of all three. For example, the text string 'The cats ran quickly' could be represented as any of the following sequences:

'the' 'cats' 'ran' 'quickly'

't' 'h' 'e' <space> 'c' 'a' 't' 's' <space> 'r' 'a' 'n' <space> 'q' 'u' 'i' 'c' 'k' 'l' 'y'

'the' <space> 'cat' 's' <space> 'ran' <space> 'quick' 'ly'

Notice that in the second and third sequences we need the <space> symbol since we are treating the source as a sequence of characters rather than a sequence of words.

We will refer to the elements of a sequence as tokens and the set of all tokens as a vocabulary.

One-hot encoding

To process token sequences in a neural network, we must convert each token into a vector. An easy way to do this is to use a one-hot encoding.

Suppose we have a vocabulary of \(N\) tokens. We assume that the vocabulary is ordered so that each token has an index from \({1,2,\cdots, N}\). In the one-hot encoding, each token is represented by an N-dimensional binary vector with a one in the position of the index for that token and zero everywhere else. Thus, the 4th token would be represented as follows:

\[\begin{bmatrix} 0& 0& 0& 1& 0& \cdots &0& 0 \end{bmatrix}\]

We can use a one-hot encoding to represent other kinds of token. For example, consider the 10 classes of CIFAR-10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). In the one-hot encoding, we would represent the class 'bird' by the binary vector:

\[\begin{bmatrix} 0&0&1&0&0&0&0&0&0&0 \end{bmatrix}\]

Word embedding

A useful technique developed for use with word tokens is to map each word into an M-dimensional space where similar words are close by in terms of Euclidean distance. The value of \(M\) is generally much less than the number of tokens in the vocabulary N (written as \(M<<N\)). The sequence of words in a text is replaced by a sequence of M-dimensional vectors. We refer to this process as word embedding.

The so-called distributional hypothesis first postulated in the 1950s states that words in similar contexts tend to have similar meanings. For example, the words 'oculist' and 'eye-doctor' tend to occur in the same context with words like 'eye' and 'examined' as in:

I called the oculist to arrange an eye examination

He became an eye-doctor after examining my eyes

You can read more about the background to this approach in chapter 6 of the draft 3rd edition of "Speech and Language Processing” by Dan Jurafsky and James H Martin.

Taking an example from this book, suppose you don't know what ongchoi means, but see these six word sequences:

ongchoi is delicious sauteed with garlic

ongchoi is superb over rice

... ongchoi leaves with salty sauces ...

... spinach sauteed with garlic over rice ...

... chard stems and leaves are delicious ...

... collard greens and other salty leafy greens

Context words in blue suggest ongchoi has a related meaning to the words in green. We produce an embedding for each token by modelling the probability that other tokens appear in the same context.

The idea of a word embedding is illustrated in the following figure from Jurafsky and Martin, where a 60-dimensional word embedding is visualised in \(2D\). Note that related words are close together and unrelated words are far apart. Some of the tokens consist of two words (e.g. 'not good')

Embedding example.

Figure 1. Visualisation of a set of words, partial words and short phrases in a 60D embedding.

Around 25 words, parts of words and two-word phrases are positioned within a rectangular area. Similar words and phrases are close together. For example, 'not good' and 'bad' are close together, and 'incredibly good and 'fantastic' are close together.

The*Word2vec* approach is widely used. There are two principal methods. We will examine the so-called skip-gram method in which a simple fully-connected encoder-decoder network is trained to model the probability of other words appearing in the same context as a given word. Each encoded word becomes the desired word-embedding.

See image description

Figure 2. Skip-gram network.

The input to the network is a one-hot encoding of a single word. This is followed by a fully connected layer from N (the size of the one-hot encoding and the number of possible words) to a smaller number, in this case 300. This is the required embedding. This is followed by a second fully-connected layer from 300 back to N, followed by softmax. The output is a probability distribution over the set of N possible words.

The network is trained using a large corpus of text. Pairs of words are chosen from the text, where that the second word appears in the context of the first word somewhere in the text. By 'context' we mean a fixed length window with the first word at its centre. The context window is typically five words (two words either side of the target word).

In the following example, pairs of words are selected methodically from a source text, using a context window of size five.

See image description

Figure 3. Generating training data for skip-gram.

The sequence of words 'The rain in Spain falls mainly on the plain' is repeated four times. In the first case, the first word 'The' is highlighted, with a less pronounced highlight around the following two words 'rain in'. In the second case, the highlighted word is 'rain' with one negihbour to the left and two to the right also highlighted. In the third case, the word 'in' is highlighted and so on. The words are selected in turn and a window formed on either side of size two, or less if too close to the start or end of the sequence.

Using word-embedding in text classification

We can use word-embedding to produce a text classifier. For example, the 20 Newsgroups dataset consists of 20,000 newsgroup documents (postings), each falling into one of the following 20 categories:

  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

The task is to identify the category from a given document. A simple kind of network for doing this looks like the following:

See image description

Figure 4. Network architecture for text classification.

The input is a sequence of IDs for the first s words of a document. Thse IDs are replaced by a 50xs tensor of word vectors using a lookup table. This tensor forms the input to a classifier.

The word-embedding is 50-dimensional and we assume a fixed width input of the first \(s\) word tokens from the given document. The classifier can be a simple linear classifier, trained using the negative log-likelihood loss function. Thus, we are seeking a classifier network that maximises the joint probability of classifications in the training data. For a given input test document, the output classification (class ID in the figure above) is the one with maximum probability.

Text classification using a CNN

Convolutional neural networks can be used in a text classifier by applying \(1D\) convolutions on the \(1D\) input sequence of embedded tokens. Because the single dimension of the convolution often corresponds to time, this form of convolution is sometimes called temporal convolution to distinguish it from the familiar \(2D\) convolutions operating on images. The operation of \(1D\) convolution with a kernel (mask) of size 3 is illustrated below:

By analogy with 2D convolution, the example shows how a window of size 3 (corresponding to a kernel of size 3) is moved across a padded 1D input vector to produce an output feature map.

Figure 5.1-D convolution with padding.

We now consider the system for text classification proposed by Conneau et al. [1], which uses temporal convolution.

Datasets

The network is evaluated on the following text datasets:

Data set Number of Trains Number of Tests Number of Classes Classification Task
AG's news 120k 7.6k 4 English news categorisation
Soguo news 450k 60k 5 Chinese news categorisation
DBPedia 560k 70k 14 Ontology classification
Yelp review polarity 560k 38k 2 Sentiment analysis
Yelp review full 650k 50k 5 Sentiment analysis
Yahoo! answers 1400K 60k 10 Topic classification
Amazon review full 3000k 650k 5 Sentiment analysis
Amazon review polarity 3600k 400k 2 Sentiment analysis

To illustrate, the Amazon review full dataset contains the reviews of Amazon customers, including comments and start-rating from \(1-5\), such as this:

Amazon customer review: 4 stars. 'Not a lot to day'. 'What can one say about a pack of batteries. Delivery however good'.

Figure 6. Amazon customer review.

The dataset contains 3000k examples (600k of each score) used for training and 650k examples (130k of each score) used for testing. The examples are chosen to have comments between \(100-2014\) characters in length.

The following table contains examples of four of the datasets:

from Conneau et al. EACL 2017 [1]
Dataset Label Sample
Yelp P. +1 Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. Hes been great over the years and is all about the big picture. [...]
Amz P. 3(/5) I love this show, however, there are 14 episodes in the first season and this DVD only shows the first eight. [...]. I hope the BBC will release another DVD that contains all the episodes, but for now this one is still somewhat enjoyable.
Sogou "Sports" ju4 xi1n hua2 se4 5 yue4 3 ri4 , be3i ji1ng 2008 a4o yu4n hui4 huo3 ju4 jie1 li4 ji1ng guo4 shi4 jie4 wu3 da4 zho1u 21 ge4 che2ng shi4
Yah. A. "Computer, Internet" "What should I look for when buying a laptop? What is the best brand and what's reliable?", "Weight and dimensions are important if you're planning to travel with the laptop. Get something with at least 512 mb of RAM. [...] is a good brand, and has an easy to use site where you can build a custom laptop."

The Sogou dataset uses pinyin, a romanisation system for Standard Chinese. The Yelp review polarity dataset contains labels that are either \(-1\) (negative review) or \(+1\) (positive review).

Input representation

The natural language in each sample is represented as a fixed-length character string of length \(s\), with 69 possible characters:

abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:’"/| #$%ˆ&*˜‘+=<>()[]{}

\(\square\) space

\(\diamond\) padding (to pad input text to s characters)

\(\oslash\) unknown character

For example, the text:

'what can one say about a pack of batteries delivery however good'

would be represented as a vector of IDs for character tokens as follows:

See image description

Figure 7. Padded character string.

A string of characters (letters, digits and punctuation) with special tokens to represent space, padding, and an unknown character. The string represents the text from the Amazon review above, with padding tokens at the end to extend to a fixed length s.

We represent the input as an \(16 \times s\) tensor, where each column is a \(16D\) character embedding. We can use a \(16 \times 69\) lookup table to find the embedding for a given character ID.

The character string from Figure 7 is mapped into a 16 x s tensor using a 16D embedding for each character.

Figure 8. Encoding the character string as a 16xs tensor using a 16D character embedding.

Network architecture

The network architecture is a deep convolutional network with the following form:

See description below

Figure 9. Network architecture for convolutional text classifier.

The network starts with a lookup table to embed the text sequence. This is followed by 1D convolution with a kernel of size 3 into 64 1D channels. This is followed by two convolution blocks (explained in Figure 10) with 64 channels, maxpooling size 2, two more convolutional blocks with 128 channels, maxpooling size 2, two more convolutional blocks with 256 channels, maxpooling size 2, two more convolutional blocks with 512 channels and maxpooling size 2. Note that the 1D string in each channel halves in length after each max-pooling operation. Finally, there are three fully connected layers, and softmax, leading to output of size nclasses.

The convolutional blocks are each repetitions of the following structure:

See description below

Figure 10. Convolutional block

The convolutional block used several times in the main network contains two convolutional layers with kernel of size 3, each followed by batch normalisation and ReLU. The number of channels (feature maps) varies as indicated in Figure 9.

Notice that the number of feature maps doubles as their size halves. This is common with CNNs applied to images.

Although one could use a pre-generated character embedding on input, it is also possible to learn the embedding by simply incorporating the entries in the lookup table into the optimisation. This will tailor the embedding to the specific classification task and probably give better performance.

k-max pooling

The so-called k-max pooling layer is defined in a paper by LeCun et al. in 1998 and works as follows:

Given some fixed \(k\) and a numerical sequence \(\myvec{p}\) of length \(n>=k\), select the subsequence \(\myvec{q}\) of the \(k\) highest values of \(\myvec{p}\). The order of the values in \(\myvec{q}\) corresponds to their original order in \(\myvec{p}\). For example, let

\[\myvec{p} = (3,71,9,2,1,19,27,13,6,49)\]

k-max pooliing on \(\myvec{p}\) for \(k=4\) gives:

\[\myvec{q}=(71,19,27,49)\]

Reference

[1] Conneau, Alexis, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2017. “Very Deep Convolutional Networks for Text Classification.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, 1107–1116.