Image-to-image translation

We have seen how a GAN can be used to generate images from a latent distribution. Here we explore the use of the GAN framework for image-to-image translation.

We will be looking in detail at CycleGAN [1], a convolutional network for image-to-image translation that has been demonstrated for a variety of tasks, including style transfer, photo enhancement, object transfiguration and semantic segmentation. This method uses an adversarial loss to train the network.

Explanation of the overall method

To ground this explanation, we will consider the style transfer task of translating from a photo to a stylised painting (e.g. in the style of Monet).

We start with the domains of photos and paintings \(X\) and \(Y\), represented by a set of photos \(\{\mymat{x}\}_{i=1}^N\subset X\) and a set of typical stylised paintings \(\{\mymat{y}\}_{j=1}^M \subset Y\) (we use the set notation here for consistency with [1]). The data probability distributions are denoted by \(\mymat{x} \sim p_{data}(\mymat{x})\) and \(\mymat{y} \sim p_{data}(\mymat{y})\).

The network architecture is based on two convolutional networks: \(G:X \rightarrow Y\) mapping from photos to paintings and \(F:Y \rightarrow X\) mapping the other way.

The figure below shows results from CycleGAN mapping from paintings to photos (using \(F\)).

painting to photo mapping — **Figure 1** Painting to photo mapping.

From *Zhu et al.* [1]

The combined network (\(G\) and \(F\)) is trained with the given sets of photos and paintings. In general, the photos will be independent of the paintings and need not depict the same scenes. If we had corresponding pairs of photos and paintings, we could use a supervised training regime. In the absence of corresponding pairs, CycleGAN uses an adversarial approach.

Consider first the mapping \(G:X \rightarrow Y\). There is a discriminator \(D_Y\) that distinguishes between real paintings from \(Y\) and fake paintings emitted by G. The discriminator \(D_Y\) is updated to increase the objective:

\[ \mathcal{L}_{GAN}(G,D_Y,X,Y)=\mathbb{E}_{\mymat{y} \sim p_{data}(y)} \log(D_Y(\mymat{y})) + \mathbb{E}_{\mymat{x} \sim p_{data}(x)} \log(1-D_Y(G(\mymat{x})))\]

There is an analagous discriminator \(D_X\) that distinguishes between real photos from \(X\) and fake photos emitted by \(F\). The discriminator \(D_X\) is updated to increase the objective \(\mathcal{L}_{GAN}(F,D_X,X,Y)\).

If we were to simply focus on training \(G\) adversarially using \(D_Y\), we might expect that \(G\) would start to produce paintings from input photos. However, it is unlikely the paintings produced would look anything like the input photos. These input photos can be seen as simply fulfilling the role of the random input vectors in DCGAN. The same argument holds for \(F\) when trained adversarially using the discriminator \(D_X\).

The solution to making the output from \(G\) resemble the input, and similarly for \(F\), is to send photos sampled from \(p_{data}(x)\) on a round tour, generating paintings using \(G\) and then returning to photos using F. We then add another term to the overall loss used in training the generator that penalises differences between the original photos and the corresponding photos after cycling through \(G\) and \(F\). This 'cycle consistency' loss is:

\[ \mathcal{L}_{cyc}(G,F)=\mathbb{E}_{\mymat{x} \sim p_{data}(x)}\big[\| F(G(\mymat{x}))-\mymat{x} \|_1\big] + \mathbb{E}_{\mymat{y} \sim p_{data}(y)}\big[\| G(F(\mymat{y}))-\mymat{y} \|_1\big]\]

The discriminators \(D_X\) and \(D_Y\) and generators \(G\) and \(F\) are updated alternately until convergence or a fixed number of epochs. Both generators are updated together using the objective:

\[ \mathcal{L}(G,F,D_X, D_Y)=\mathcal{L}_{GAN}(G,D_Y,X,Y) + \mathcal{L}_{GAN}(F,D_X,X,Y) + \lambda\mathcal{L}_{cyc}(G,F) \]

where \(\lambda\) controls the relative importance of the photos and paintings looking realistic, versus the extent to which input and output images are in spatial correspondence.

Reading

To gain an in depth understanding of CycleGAN, please look at the dedicated website and read the paper that appeared in the International Conference in Computer Vision in 2017 [1]. The paper and website contain many examples of CycleGAN applied to a range of tasks.

Reference

[1] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks", in IEEE International Conference on Computer Vision (ICCV), 2017.