Deep Visual-Semantic Alignments for Generating Image Description

1 Introduction  

This paper uses CNN to learn image regions embedding, bidirectional RNN to learn sentence embeddings, associate them into a common multimodal space, and a structured objective to align two modalities.  Then it proposes a mutlimodal RNN to learn to generate description of image regions using the aligned image-sentence pair. 

 

2 Model Description

2.1 Learn to Align visual data and language data

First use RCNN to extract 19 regions from the image, use pre-trained CNN to express each regions with real-value vector.

f:id:PDFangeltop1:20160204104454p:plain Wm: h*4096 

2.2 Represent Sentence

Use Bidirectional Recurrent Neural Network.

f:id:PDFangeltop1:20160204104752p:plain 

use word2vec to  initilize the word vector  and keep fixed during training.

2.3 Alignment objective

The objective encourages aligned image-sentence pairs to have a higher score than misaligned pairs by a margin.

f:id:PDFangeltop1:20160204105146p:plain

f:id:PDFangeltop1:20160204105210p:plain

gk is a set of image-regions, gl is a set of sentence fragments.

Then Skl measures how well sentence l and image k are aligned.

2.4 Decoding text segment alignments to images

f:id:PDFangeltop1:20160204105857p:plain

f:id:PDFangeltop1:20160204105911p:plain  why not Vt*Sj?

(N words. M image regions , aj [1...M ])

The aim is to align continuous text sequences of words to a single bounding box.

We minimize the energy to find the best alignments using dynamics. The output  is a set of image regions annotated with segments of text.

 

2.5 Multimodal Recurrent Neural Network for generating descriptions

 

f:id:PDFangeltop1:20160204113026p:plainf:id:PDFangeltop1:20160204113020p:plain

 

3 Limitation.

Model can only generate a description of one input array of pixels at a fixed resolution.

A more sensible approach might be to use their mutual interactions and wider context before generating a description.

And the approach consists of two seperate models. Going directly from an image-sentence dataset to region-level annotations as part of a single model trained end-to-end may be more desirable.