Generative Images From Captions With Attention

1 Introduction

This work propose a sequential deep learning model to generate images from captions, the model draws a patch on a canvas in a time series and attend to relevant word at each time.

 

2 Related work

Deep discriminative Model VS Deep Generative Model

The previous generative models focus on Boltzman Machine and Deep Belief Network, whose drawback is that each of them requires a computationally costly step of MCMC to approximate derivates of an intractable partition function, making it difficult to scale them to large datasets.

Variational Auto-Encoder can be seen as a neural network with continuous latent variables. The encoder is used to approximate the posterior distribution and the decoder is used to stochastically reconstruct the data from latent variables.

Generative Adersarial Network use noise-contrastive estimation to avoid calculating an intractable partition function. The model consists of a generator that generates samples using a uniform distribution and a discriminator that discriminates between real image and generated images.

 

3 Model

The captions are reprensented as a sequence of consecutive words and images are represented as a sequence of patches  drawn on a canvas ct over time t=1,2,....,T. The model can be viewed as a  part of the sequence-to-sequence framework.

説明文をイメージ両方を時系列として捉えてシーケンスツーシーケンスのフレームワークを適用できる

f:id:PDFangeltop1:20160207110346p:plain

3.1 Language Model

Caption is represented as a sequene of 1-of-K encoded words. 

f:id:PDFangeltop1:20160207110221p:plain

And use bi-LSTM to model it.

f:id:PDFangeltop1:20160207110306p:plain

3.2 Image Model

The conditional DRAW network is a stochastic recurrent neural network that consists of a sequence of latent variables Zt, where the output is accumulated over all T time steps.

The mean and variance of the prior distribution over Zt is:

f:id:PDFangeltop1:20160207112100p:plain

f:id:PDFangeltop1:20160207112149p:plain

The align function is used to compute the alignment between the input caption and intermediate image generative steps.

The formula is as follows:

f:id:PDFangeltop1:20160207114826p:plain

f:id:PDFangeltop1:20160207121034p:plain

 The output of LSTM(gen) is passed to the write operator which is added to a cumulative canvas matrix.

f:id:PDFangeltop1:20160207122355p:plain

 Finally, each entry Ct,i from the final canvas matrix Ct is tranformed using a sigmoid function to produce a conditional Bernoulli distribution.

 

3.3 Learning

The model is trained to maximize a variational lower bound L on the marginal likelihood of the correct image x given the input caption y:

f:id:PDFangeltop1:20160207123256p:plain

f:id:PDFangeltop1:20160207123852p:plain

f:id:PDFangeltop1:20160207123858p:plain

 

4 The evaluation

f:id:PDFangeltop1:20160207124350p:plain

f:id:PDFangeltop1:20160207124416p:plain

The main goal of this work is to learn a model that can understand the semantic meaning expressed in the textual descriptions of images, such as the properties of objects, the relationships between them, and then use the knowledge to generate relevant images. So the authers change some words in the caption and see whether the model made relevant changes in the generated samples.