Generative Adversarial Text to Image Synthesis

1 Introduction

This work proposes a method to translate text into image pixels. One thorny remaining issue not solved by deep learning alone is that the distribution of images conditioned on a  text is highly multimodal, in the sense that there are very many plausible configurations of pixels that correctly illustrate the description.  This work develops a simple and effective GAN architecture and training strategy that enables compelling text to image synthesis of bird and flower images. 


2 Model

The approach is to train a deep convolutional generative adversarial network(DC-GAN) conditioned on text features. Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text feature.


2.1 Network

In generator G, we sample a noise z and get the text feature compressed by a fully-connected layer to a smaller dimension, and concatenate them together and feed into the network.

In discriminator D, we perform several layers of stride-2 convolution with spatial batch normalization followed by leaky-relu. When the spatial dimension of the discriminator is 4*4, we replicate the description embedding spatially and perform a depth concatenation. 

2.2 Match-aware discriminator (GAN-CLS)

Once G has learned to generate plausible images, it must also learn to align them with the conditioning information, and likewise, D must learn to evaluate whether samples from G meet this conditioning constraint.The discriminator observes two kinds of error. The first is the unrealistic image(for any text), and realistic images of wrong class that mismatches the conditioning information. 


2.3 Learning with manifold interpolation(GAN-INT)

The amount of text data is a limiting factor for image generation performance. And it has been observed that deep neural network tends to learn high-level representations in which interpolation between training data embeddings are also on or near the data manifold. 

Motivated by this property, we can generate a large amount of additional text embedding by simply interpolating between embeddings of training set captions.

This can be viewed as adding an additional term to the generator objective to minimize(5). 


2.4 Inverting the generator for style transfer

If the text encoding f(t) captures the image content, then in order to generate a realistic image the noise sample z should capture style factors such as background color and pose. With a trained GAN, one may wish to transfer the style of a query image onto the content of a particular text description. To achieve this, one can train a convolutional network to invert G to regress from sample x' <- G(z, f(t)) back onto z. We use a simple squared loss to train the style encoder: 


where S is the style encoder network. With a trained generator and style encoder , style transfer from a query image x onto text t proceeds as follows: where x' is the resulting image and s is the predicted style.
