Multimodal Convolutional Neural Network for Matching Image and Sentence (Accepted by ICCV 2015)

This paper provides an end-to-end framework to match "image representation" and "word composition". More specifically, it consists of an image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence.

The paper also summarizes many strong work in the field of matching image and text.They are "bidirectional image and sentence retrieval" and  automatic image captioning . 

For bidirectional image and sentence retrieval, the papers are:

1 Framing image description as a ranking task: data, models and evaluation metrics...

2 Deep correlation for matching images and text. (CVPR2015)

3 Associating neural word embeddings with deep image representation using fisher vectors (CVPR 2015)

4 Skip-thought vector.

5 Flickr30k entities: Collecting region-to-phrase correspondencs for richer image-to-sentence models.

For automatic image captioning,

1 deep captioning with multimodal recurrent neural network(ICRL 2015)

2 explian images with multimodal recurrent neural network

3 Multimodal neural language model. (ICML 2014)

4 Unifying visual-semantic embeddings with multimodal neural language models

5 Show and tell: a neural image caption generator

6 Long-term recurrent convolutional networks for visual recognition and description.

 

Three stages of the model

1 Image CNN:       f:id:PDFangeltop1:20151210094350p:plain

get the CNN code using VGG model(very deep convotlutional networks)

matrix Wim: d * 4096. (d = 256) So finally we can get an image vector of dimension 256.

2 Matching CNN:

  2.1 word-level matching CNN                f:id:PDFangeltop1:20151210095502p:plain

w(l,f) are the paramters for the f featuref:id:PDFangeltop1:20151210095812p:plain

map on lth layer. krp is the windows size set for 3.

 f:id:PDFangeltop1:20151210095235p:plain

The image vector is concatenated with every word windows vector.

 A trick : to handle variable lengthes, the maximum length of sentence is fixed.

 f:id:PDFangeltop1:20151210095248p:plain

And max-pooling: f:id:PDFangeltop1:20151210095410p:plain

  2.2 Phrase-level maching CNN

    f:id:PDFangeltop1:20151210095513p:plain

    let CNN work solely on words to certain levels before interacting with the image.

  2.3 Sentence-level matching CNN

  f:id:PDFangeltop1:20151210095336p:plain

  3.4 Summing the score of four models together : MatchCNNens.

f:id:PDFangeltop1:20151210095426p:plain

 3 two-layer multilayer perceptron. 

 

Learning Phrase:  using a ranking loss function. 

f:id:PDFangeltop1:20151210101432p:plain

(xn,yn) is the corelated image-sentence pair. (xn,ym) is the randomly sampled uncorrelated image-sentence paris.

u = 0.5 ,and using early-stop , dropout(0.1)

 

Data set: Flickr8K, Flickr30K, Microsoft COCO.