Show, Attend, and Tell: Neural Image Caption Generation with visual Attention
各种带隐藏变量的机器学习模型学会了吗? 神经网络,RBM,各种概率图
EM算法,变分推演,平均场等等套隐藏变量的求解方法
以上算法背后的凸优化方法
推公式
1 Introduction
In the past, to solve image caption task, one always extracts features from image using the fully-connected layer vector which is called CNN-code. But rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. They introduce two attention-based image caption generators under a common framework, (a) a soft deterministic attention mechanism trainable by standard back-propagation methods (b) a hard stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by reinforce.
2 Image Caption Generation with Attention Mechanism
2.1 Encoder: convolutional features
The model takes raw image as input and generates a caption y encoded as 1-of-K vectors.
We extract L features from CNN each of which is a D-dimensional vector corresponding to a part of image.
2.2 Decoder: LSTM
The LSTM generates one word at each time stamp conditioned on the hidden state, previous generated word and a context vector(image attention).
z is the context vector, capturing the visual information associated with a particular input location and is a dynamic representation of the relevant part of image as the time passes. More specifically, z is computed by L features, for each feature, we assign a coefficient to it representing the relative importance to the current context vector.
We use deep output layer to compute the output word probability given the LSTM state, the context vector and the previous word.
n is the dimension of LSTM hidden state. m is the embedding dimension.
2.3 Stochastic Hard Attention
s(t,i) is an indicator one-hot variable which is set to 1 if the ith location(out of L) is the one used to extract visual features.
The objective function Ls to be optimized.
(11) is a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location st from a multinoulli distribution.
In making a hard choice every point , Equation (6) is a function that returns a sampled ai at every point in the time based upon a mutlinoulli distribution parameterized by alpha.
2.4 Deterministic soft Attention
2.5 Traing Procedure
Optimization method: RMSProp and Adam.
How to get ai? Use VGGnet the fourth Convnet Layer before max-pooling 14*14*512.
This means the decoder operates on the flattened 196*512 (L*D)