Show, Attend, and Tell: Neural Image Caption Generation with visual Attention

各种带隐藏变量的机器学习模型学会了吗? 神经网络,RBM,各种概率图

EM算法,变分推演,平均场等等套隐藏变量的求解方法

以上算法背后的凸优化方法

推公式

1 Introduction

In the past,  to solve image caption task, one always extracts features from image using the fully-connected layer vector which is called CNN-code. But rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. They introduce two attention-based image caption generators under a common framework, (a) a soft deterministic attention mechanism trainable by standard back-propagation methods (b) a hard stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by reinforce.

f:id:PDFangeltop1:20160205113146p:plain

2 Image Caption Generation with Attention Mechanism

2.1 Encoder: convolutional features 

The model takes raw image as input and generates a caption y encoded as 1-of-K vectors.

f:id:PDFangeltop1:20160205113441p:plain

We extract L features from CNN each of which is a D-dimensional vector corresponding to a part of image.

f:id:PDFangeltop1:20160205113618p:plain

2.2 Decoder: LSTM

The LSTM generates one word at each time stamp conditioned on the hidden state, previous generated word and a context vector(image attention).

f:id:PDFangeltop1:20160205114657p:plain

f:id:PDFangeltop1:20160205114019p:plain

z is the context vector, capturing the visual information associated with a particular input location and is a dynamic representation of the relevant part of image as the time passes. More specifically,  z is computed by L features, for each feature, we assign a coefficient to it representing the relative importance to the current context vector

f:id:PDFangeltop1:20160205115242p:plain

f:id:PDFangeltop1:20160205115256p:plain

We use deep output layer to compute the output word probability given the LSTM state, the context vector and the previous word.

f:id:PDFangeltop1:20160205115742p:plain

f:id:PDFangeltop1:20160205115749p:plain

n is the dimension of LSTM hidden state. m is the embedding dimension.

 

2.3 Stochastic Hard Attention

f:id:PDFangeltop1:20160205131658p:plain

 s(t,i) is an indicator one-hot variable which is set to 1 if the ith location(out of L) is the one used to extract visual features.

f:id:PDFangeltop1:20160205150217p:plain

The objective function Ls to be optimized. 

f:id:PDFangeltop1:20160205150318p:plain

(11) is a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location st from a multinoulli distribution. 

f:id:PDFangeltop1:20160205150613p:plain

f:id:PDFangeltop1:20160205150635p:plain

f:id:PDFangeltop1:20160205150645p:plain

In making a hard choice every point , Equation (6) is a function that returns a sampled ai at every point in the time based upon a mutlinoulli distribution parameterized by alpha.

 

2.4 Deterministic soft Attention

f:id:PDFangeltop1:20160205155315p:plain

f:id:PDFangeltop1:20160205155308p:plain

f:id:PDFangeltop1:20160205155722p:plainf:id:PDFangeltop1:20160205155727p:plain

f:id:PDFangeltop1:20160205155631p:plain

f:id:PDFangeltop1:20160205155643p:plain

f:id:PDFangeltop1:20160205160955p:plain

f:id:PDFangeltop1:20160205161006p:plain

f:id:PDFangeltop1:20160205161015p:plain

 2.5 Traing Procedure 

Optimization method: RMSProp and Adam.

 How to get ai? Use VGGnet the fourth Convnet Layer before max-pooling 14*14*512.

This means the decoder operates on the flattened 196*512 (L*D)