Paper Reading: Image Captioning with Semantic Attention

1 Introduction

The contribution of this paper is to propose an attention mechanism which is integrated in a RNN framework, over the visual attributes of the given image and use the attention over visual attributes as well as the information of the last time stamp, to generate information of the current time stamp. The author regards this proposal as a combination of the existing top-down approach , that is ,extracting features from image as a gist (a real-valued vector)and using it to generate caption by rnn, and bottom-up approach, that is ,extracting several discrete visual attributes from the image and build the sentence on these visual attributes.

f:id:PDFangeltop1:20160406161035p:plain

2 Model

f:id:PDFangeltop1:20160406163227p:plain

2.1 Overall framework

They first extract a feature of the whole image , served as a quick overview of the image content and also a set of attribute detectors each of which corresponds to an entry in the vocabulary,denoted as Ai.

f:id:PDFangeltop1:20160406164936p:plain

2.2 The input attention model

f:id:PDFangeltop1:20160406165401p:plain

f:id:PDFangeltop1:20160406165408p:plain

w models the relative importance of visual atrributes in each dimension of the word space.

2.3 Ouput attention model

f:id:PDFangeltop1:20160406165707p:plain

f:id:PDFangeltop1:20160406165719p:plain

V is the bilinear parameter matrix, activation function is used to ensure the same nolinear transoform is applied to the two feature vectors before they are computed.

2.4 Model learning

f:id:PDFangeltop1:20160406170021p:plain

f:id:PDFangeltop1:20160406170030p:plain

p>1 penalize execssive attention of a specific attribute accumulated over the entire sentence, while q<1 penalize diverted attention to multiple attributes at any particular time.

2.5 Visual attribute prediction

They build a set of fixed visual attributes by selecting the most common words from the captions in the training data. The resulting attributes are treated as a set of predefined categories and be learned by as a conventional classification problem.