Multimodal Convolutional Neural Network for Matching Image and Sentence (Accepted by ICCV 2015)

This paper provides an end-to-end framework to match "image representation" and "word composition". More specifically, it consists of an image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence.

The paper also summarizes many strong work in the field of matching image and text.They are "bidirectional image and sentence retrieval" and automatic image captioning .

For bidirectional image and sentence retrieval, the papers are:

1 Framing image description as a ranking task: data, models and evaluation metrics...

2 Deep correlation for matching images and text. (CVPR2015)

3 Associating neural word embeddings with deep image representation using fisher vectors (CVPR 2015)

4 Skip-thought vector.

5 Flickr30k entities: Collecting region-to-phrase correspondencs for richer image-to-sentence models.

For automatic image captioning,

1 deep captioning with multimodal recurrent neural network(ICRL 2015)

2 explian images with multimodal recurrent neural network

3 Multimodal neural language model. (ICML 2014)

4 Unifying visual-semantic embeddings with multimodal neural language models

5 Show and tell: a neural image caption generator