Generative Adversarial Text to Image Synthesis

1 Introduction

This work proposes a method to translate text into image pixels. One thorny remaining issue not solved by deep learning alone is that the distribution of images conditioned on a text is highly multimodal, in the sense that there are very many plausible configurations of pixels that correctly illustrate the description. This work develops a simple and effective GAN architecture and training strategy that enables compelling text to image synthesis of bird and flower images.

2 Model

The approach is to train a deep convolutional generative adversarial network(DC-GAN) conditioned on text features. Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text feature.

f:id:PDFangeltop1:20160530180703p:plain

2.1 Network

In generator G, we sample a noise z and get the text feature compressed by a fully-connected layer to a smaller dimension, and concatenate them together and feed into the network.

In discriminator D, we perform several layers of stride-2 convolution with spatial batch normalization followed by leaky-relu. When the spatial dimension of the discriminator is 4*4, we replicate the description embedding spatially and perform a depth concatenation.

2.2 Match-aware discriminator (GAN-CLS)

Once G has learned to generate plausible images, it must also learn to align them with the conditioning information, and likewise, D must learn to evaluate whether samples from G meet this conditioning constraint.The discriminator observes two kinds of error. The first is the unrealistic image(for any text), and realistic images of wrong class that mismatches the conditioning information.

f:id:PDFangeltop1:20160530195542p:plain

2.3 Learning with manifold interpolation(GAN-INT)

The amount of text data is a limiting factor for image generation performance. And it has been observed that deep neural network tends to learn high-level representations in which interpolation between training data embeddings are also on or near the data manifold.

Motivated by this property, we can generate a large amount of additional text embedding by simply interpolating between embeddings of training set captions.

This can be viewed as adding an additional term to the generator objective to minimize(5).

f:id:PDFangeltop1:20160530201158p:plain

2.4 Inverting the generator for style transfer

If the text encoding f(t) captures the image content, then in order to generate a realistic image the noise sample z should capture style factors such as background color and pose. With a trained GAN, one may wish to transfer the style of a query image onto the content of a particular text description. To achieve this, one can train a convolutional network to invert G to regress from sample x' <- G(z, f(t)) back onto z. We use a simple squared loss to train the style encoder:

f:id:PDFangeltop1:20160530202645p:plain

where S is the style encoder network. With a trained generator and style encoder , style transfer from a query image x onto text t proceeds as follows: where x' is the resulting image and s is the predicted style.

f:id:PDFangeltop1:20160530202652p:plain