Show, Attend, and Tell: Neural Image Caption Generation with visual Attention

各种带隐藏变量的机器学习模型学会了吗？神经网络，RBM，各种概率图

EM算法，变分推演，平均场等等套隐藏变量的求解方法

以上算法背后的凸优化方法

推公式

1 Introduction

In the past, to solve image caption task, one always extracts features from image using the fully-connected layer vector which is called CNN-code. But rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. They introduce two attention-based image caption generators under a common framework, (a) a soft deterministic attention mechanism trainable by standard back-propagation methods (b) a hard stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by reinforce.

f:id:PDFangeltop1:20160205113146p:plain

2 Image Caption Generation with Attention Mechanism

2.1 Encoder: convolutional features

The model takes raw image as input and generates a caption y encoded as 1-of-K vectors.

f:id:PDFangeltop1:20160205113441p:plain

We extract L features from CNN each of which is a D-dimensional vector corresponding to a part of image.

f:id:PDFangeltop1:20160205113618p:plain

2.2 Decoder: LSTM

The LSTM generates one word at each time stamp conditioned on the hidden state, previous generated word and a context vector(image attention).

f:id:PDFangeltop1:20160205114657p:plain

f:id:PDFangeltop1:20160205114019p:plain

z is the context vector, capturing the visual information associated with a particular input location and is a dynamic representation of the relevant part of image as the time passes. More specifically, z is computed by L features, for each feature, we assign a coefficient to it representing the relative importance to the current context vector.

f:id:PDFangeltop1:20160205115242p:plain

f:id:PDFangeltop1:20160205115256p:plain

We use deep output layer to compute the output word probability given the LSTM state, the context vector and the previous word.

f:id:PDFangeltop1:20160205115742p:plain

f:id:PDFangeltop1:20160205115749p:plain

n is the dimension of LSTM hidden state. m is the embedding dimension.

2.3 Stochastic Hard Attention

f:id:PDFangeltop1:20160205131658p:plain

s(t,i) is an indicator one-hot variable which is set to 1 if the ith location(out of L) is the one used to extract visual features.

f:id:PDFangeltop1:20160205150217p:plain

The objective function Ls to be optimized.

f:id:PDFangeltop1:20160205150318p:plain

(11) is a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location st from a multinoulli distribution.

f:id:PDFangeltop1:20160205150613p:plain

f:id:PDFangeltop1:20160205150635p:plain

f:id:PDFangeltop1:20160205150645p:plain

In making a hard choice every point , Equation (6) is a function that returns a sampled ai at every point in the time based upon a mutlinoulli distribution parameterized by alpha.

2.4 Deterministic soft Attention

f:id:PDFangeltop1:20160205155315p:plain