Deep Compositional Captioning: Describe Novel Object Categories without Paired Training Data

1 Inroduction

In the past, the image caption model can only be trained on paired image-sentence corpora. To address this limitation, the author proposed a Deep Compositional Captioner that can generate descriptions about objects which don't appear in paired corpora ,but are present in unpaired image data(object recognition datasets) and unpaired text data

f:id:PDFangeltop1:20160201171229p:plain

First train a lexical classifier and a language model, then combine them into a deep caption model which is trained on paired dataset. Second, the multimodal layer where knowledge from known objects can be transferred to new objects only seen in unpaired dataset.

There are two main approaches for caption.

(1)RNN-CNN framework: high-level features are extracted from image using trained CNN for image classification task.(VGGNet), then a recurrent network learn to predict sentences conditioned on image features and previously predicted works.

(2)Construct a multimodal space: recurrent language features and image features are embedded in a multimodal space. The multi modal is then used to predict the caption word by word.

Zero-shot Learning: images are mapped to semantic word vectors corresponding to their classes, and the resulting image embeddings are used to detect and distinguish between between unseen and seen classes.

The idea is to transfer information from weights which are trained on image-sentence data to weights which are only trained on text data.

2 Deep Compositional Captioner

2.1 Deep Lexical Classifier

A CNN which maps images to semantic concepts. Mine some concepts in paired image-text data , some adjectives, verbs, and nouns. The lexical classifier is trained by fine-tuning a CNN which is pretrained. The output is fi, where each index of fi corresponds to the probability that a particular concept is present in the image.So actually multiple labels are tagged to each image.

2.2 Language Model

LSTM

2.3 Caption Model

f:id:PDFangeltop1:20160201194725p:plain f:id:PDFangeltop1:20160201194600p:plain

Both language model and caption model are trained to predict a sequence of words, whereas the lexical classifier is trained to predict a fixed set of candidate visual elements for a given image.

2.4 Transfer Learning

(1)Direct transfer:

For exmaple, the word "sheep" is in the paired dataset, and "alpaca" is not.

Now we need to calculate fiWi[:, ca] + flWl[:,ca] + b[ca], to transfer the knowledge that model captured from sheep to alpaca(a kind of sheep),

Wi[:,ca], Wl[;ca], b[ca] = Wi[:,cs], Wl[;cs], b[cs]

Wi[ra,ca] = Wi[rs,ca]

Wi[rs,ca] = Wi[ra,cs] = 0

(2) Delta Transfer :

f:id:PDFangeltop1:20160202113500p:plain

f:id:PDFangeltop1:20160202113503p:plain

We use word2vec to determine the word similarity.