Deep Visual-Semantic Alignments for Generating Image Description

1 Introduction

This paper uses CNN to learn image regions embedding, bidirectional RNN to learn sentence embeddings, associate them into a common multimodal space, and a structured objective to align two modalities. Then it proposes a mutlimodal RNN to learn to generate description of image regions using the aligned image-sentence pair.

2 Model Description

2.1 Learn to Align visual data and language data

First use RCNN to extract 19 regions from the image, use pre-trained CNN to express each regions with real-value vector.

f:id:PDFangeltop1:20160204104454p:plain Wm: h*4096

2.2 Represent Sentence

Use Bidirectional Recurrent Neural Network.

f:id:PDFangeltop1:20160204104752p:plain

use word2vec to initilize the word vector and keep fixed during training.

2.3 Alignment objective

The objective encourages aligned image-sentence pairs to have a higher score than misaligned pairs by a margin.

f:id:PDFangeltop1:20160204105146p:plain

f:id:PDFangeltop1:20160204105210p:plain

gk is a set of image-regions, gl is a set of sentence fragments.

Then Skl measures how well sentence l and image k are aligned.

2.4 Decoding text segment alignments to images

f:id:PDFangeltop1:20160204105857p:plain

f:id:PDFangeltop1:20160204105911p:plain why not Vt*Sj?

(N words. M image regions , aj [1...M ])

The aim is to align continuous text sequences of words to a single bounding box.

We minimize the energy to find the best alignments using dynamics. The output is a set of image regions annotated with segments of text.

2.5 Multimodal Recurrent Neural Network for generating descriptions

f:id:PDFangeltop1:20160204113026p:plain

3 Limitation.

Model can only generate a description of one input array of pixels at a fixed resolution.

A more sensible approach might be to use their mutual interactions and wider context before generating a description.

And the approach consists of two seperate models. Going directly from an image-sentence dataset to region-level annotations as part of a single model trained end-to-end may be more desirable.