Generation and Comprehension of Unambiguous Object Description

1 Inroduction

The normal image caption task suffers a problem of the difficulty of evalution. There is no very convincing metric evalution that say one generation is exactly better than others. So this work does not generate deccription from the whole image but just from a specific region that describe an object which can be easily disambiguated from other objects of the same kind.

So the metric becomes obvious. That is a referring expression is considered a good description only if it can uniquely describe the relevant region and a listener can point out which part of the image generation is referring to.

f:id:PDFangeltop1:20160201114554p:plain

There are two tasks

(1) given the whole image and sub-image, generation the description of sub-image.

f:id:PDFangeltop1:20160201113449p:plain R is sub-image region, I is the whole image, S is the generation.

(2) given the whole image, description and several sub-images candidates, select the most relevant one.

f:id:PDFangeltop1:20160201113552p:plain

In summarization, the contirbutions are (1) provide us a new data set. (2) develop a new method for jointly generation and comprehension.

2 Baseline Method

2.1 Compute P(S|R,I)

the sub-image and image are input to VGGNet, and get 1000-dimensional vector respectively., there are another 5 coordinate-related features, so in total we obtain a 2005-dimensional vector that can be fed into LSTM sequence.

f:id:PDFangeltop1:20160201115120p:plain

For LSTM, 1024-dim hidden state vector and 1024-dim word vector

2.2 Maximum Likelihood Training

f:id:PDFangeltop1:20160201115509p:plain

Batchsize 16, learning rate 0.01, gradient clipped to a maximum norm 10,

dropout rate 0.5

3 Full Method

Impovement: We not only want that given the right sub-image, the probability of description is largest, but also given the all-other sub-images, the probability should be

small. Formular (2) only consider the right sub-image, so we need to consider other sub-images by adding other items.

f:id:PDFangeltop1:20160201144823p:plain

f:id:PDFangeltop1:20160201144929p:plain

The max-margin compares two regions, so the network should be replicated twice.

4 Evaluation

For comprehension, computer the Intersection over Union(IoU) ratio between the true and predicted bounding box. If IoU exceeds 0.5, consider it a true example.

For generation task, use Amazon Mechanical Turk.(Human Evaluation). Another approach is to pass the automatically generated sentence into comprehension model to verify if they get the correctly decoded to the original object of interest.