Generation and Comprehension of Unambiguous Object Description

1 Inroduction

The normal image caption task suffers a problem of the difficulty of evalution.  There is no very convincing metric evalution that say one generation is exactly better than others.  So this work does not generate deccription from the whole image but just from a specific region that describe an object which can be easily disambiguated from other objects of the same kind.

So the metric becomes obvious. That is a referring expression is considered a good description only if it can uniquely describe the relevant region and a listener can point out which part of the image  generation is referring to.

f:id:PDFangeltop1:20160201114554p:plain

There are two tasks

(1) given the whole image and sub-image, generation the description of sub-image.

f:id:PDFangeltop1:20160201113449p:plain  R is sub-image region, I is the whole image, S is the generation.

(2) given the whole image, description and several sub-images candidates, select the most relevant one. 

f:id:PDFangeltop1:20160201113552p:plain 

In summarization, the contirbutions are (1) provide us a new data set. (2) develop a new method for jointly generation and comprehension.

 

2 Baseline Method

2.1 Compute P(S|R,I)

the sub-image and image are input to VGGNet, and get 1000-dimensional vector respectively., there are another 5 coordinate-related features, so in total we obtain a 2005-dimensional vector that can be fed into LSTM sequence.

f:id:PDFangeltop1:20160201115120p:plain

For LSTM, 1024-dim hidden state vector and 1024-dim word vector

2.2 Maximum Likelihood Training

f:id:PDFangeltop1:20160201115509p:plain

Batchsize 16, learning rate 0.01, gradient clipped to a maximum norm 10,

dropout rate 0.5

3 Full Method

Impovement: We not only want that given the  right sub-image, the probability of description is largest, but also given the all-other sub-images, the probability should be

small. Formular (2) only consider the right sub-image, so we need to consider other sub-images by adding other items.

f:id:PDFangeltop1:20160201144823p:plain

f:id:PDFangeltop1:20160201144929p:plain

The max-margin compares two regions, so the network should be replicated twice. 

 

4 Evaluation

For comprehension, computer the Intersection over Union(IoU) ratio between the true and predicted bounding box. If IoU exceeds 0.5, consider it a true example.

For generation task, use Amazon Mechanical Turk.(Human Evaluation). Another approach is to pass the automatically generated sentence into comprehension model to verify if they get the correctly decoded to the original object of interest.

 

 

 

 

 

 

 

Paper Reading -> Learning like a child ,Fast Novel Concept Learning from sentence Description of Images

1 Introduction      

This paper address the problem of generating descriptions from images. The difference from this paper to other work is that the author proposed a method to deal with the new concept not seen in the training set.  More specifically,  once the model is trained on old image data, and now there comes some new image data that describe new visual concepts not seen in old training data, how to deal with new data? One way is to retrain the model totatlly from scratch and discard the old parameters, well it is fairly not a too terrible method,  but it is a waste of both time and computation resource to train the modal again. So, the author suggest that we can add some parameter corresponding to new data and keep the old parameter fixed while training the new paramters.

   The author present a Normal Visual Concept learning from Sentences(NVCS) framework. First , train the base model on the old data using m-RNN, 

f:id:PDFangeltop1:20160111165058p:plain

  There is a transposed weight sharing(TWS) stragedy, to convert a 1024-dimension multimodal vector to a 512-dimension one. The motivation to do this is to first reduce the parameter numbers. (From N*1024 to N*512 ,N may be more than 100000), and second to easily add parameter corresponding to new data.

f:id:PDFangeltop1:20160111165613p:plain

f:id:PDFangeltop1:20160111170515p:plain  

f:id:PDFangeltop1:20160111170508p:plain discard the Um and replace it with Ui and Ud, where Ui : 1024*512, 

f:id:PDFangeltop1:20160111170512p:plain

For Novel Concept Learning , we first fix the originally learned weights regard to old data.

More specifically, Ud = [Ud_old, Ud_new], we keep Ud_old fixed and train Ud_new.

And then fix the baseline probabilty.

 

f:id:PDFangeltop1:20160111170518p:plain

we set bn' to be the average value of the element in bo' and fix bn'. 

So we fix Ud_old, bo' and bn' , only train on Ud_new.

2 Dataset and Evaluation

They use the MSCOCO dataset, take out object "cat" and "motor" for novel visual dataset, and the remain to be base dataset. The relation of each train, validation ,test set are shown below.

f:id:PDFangeltop1:20160111171341p:plain

They calucate the BLEU and METEOR scores, which evaluate the overall quality of the generated sentences. 

The TWS and Baseline Probability Fixation stragedies are effective.

 f:id:PDFangeltop1:20160111175311p:plain

 And we can use not many training examples (NC training set.)to get a reasonable scores.

f:id:PDFangeltop1:20160111175521p:plain

With about 10-50 trainingimages, the model achieves comparable performance.

 

Some papers about image caption

1 Deep Visual-Semantic Alignments for Generating Image Description (CVPR2015)

2 long-term recurrent convolutional networks for visual recognition and description cvpr2015

3 Show and tell, a neural image caption generator.

4 Unifying visual-semantic embeddins with multimodal neural language model.

   This work has published code on github.  

github.com

and author also published other interesting work such as skip-thought.

5 Explain images with multimodal recurrent neural networks

6 Show, Attend and Tell, neural image caption generation with visual attention. 

 aslo very interesting work. use MCMC method.

7 Mind's eye: a recurrent visual representation for image caption generation. cvpr2015 CMU

8 From captions to visual concepts and back.

9 Order-embeddings of images and language

10 deep compositional captions: describing novel object categories without paired training data.

 

 

Deep Fragment Embedding for bidirectional Image sentence mapping

1 Introduction:

This model works on a finer level and embeds fragments of images(objects) and fragments of sentences into a common space. And the paper states that both global level of images and sentences and the finer level of their respective fragments improve performance on image-sentence retrieval tasks.

 

2 Model Description 

Detect objects as image fragments and use sentence dependency tree relations as sentence fragments.

Embed the images and sentences into a common space and the parameters are trained such that true image-sentence pairs have an inner product(interpreted as a score) higher than false image-sentence pairs by a margin.

2.1 Dependency tree relations as sentence fragments 

For every dependency triplet (R,w1,w2) :

f:id:PDFangeltop1:20151212194817p:plain

We 400,000 * d. fixed. The dimension s is cross-validated.

2.2 Object detections as image fragments

 Detect object in image with a Region CNN.And use the 19 detected locations and the entire image as the image fragments and compute the embedding vectors based on pixels Ib inside each bounding box as follows.  

f:id:PDFangeltop1:20151212195015p:plain 

CNN(Ib) returns the 4096-dimensional activations of the fully connected layer immediately before the classifier.

 

2.3 objective function      

      f:id:PDFangeltop1:20151212195846p:plain

      a) Fragment Alignment objective : if a sentence contains a fragment , at least one of the boxes in the corresponding image should have a high score with this fragment.

         f:id:PDFangeltop1:20151212200151p:plain Constant Kij normalize the objective with respect to the number of positive and negative yij and yij = 1 if vi and sj occur together in a corresponding image-sentence pair. 

      b) Global Ranking objective:   ensures that the computed  image-sentence similiarities are consisent with the ground truth annotation.

f:id:PDFangeltop1:20151212200513p:plain

f:id:PDFangeltop1:20151212200515p:plain

delta are cross-validated. 

2.4 Optimization

   SGD, batchsize 100, momentum of 0.9. 15 epoches and fune tune RCNN after first 10 epoches.

    

3 Conclusion

      From a modeling perspective, sentence are only modeled as bags of relations. And the model is incapable of counting. In the future authors wants to extend the model to support counting, reasoning about spatial positions of obects, and move beyond bags of framents. 

 

 

Multimodal Convolutional Neural Network for Matching Image and Sentence (Accepted by ICCV 2015)

This paper provides an end-to-end framework to match "image representation" and "word composition". More specifically, it consists of an image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence.

The paper also summarizes many strong work in the field of matching image and text.They are "bidirectional image and sentence retrieval" and  automatic image captioning . 

For bidirectional image and sentence retrieval, the papers are:

1 Framing image description as a ranking task: data, models and evaluation metrics...

2 Deep correlation for matching images and text. (CVPR2015)

3 Associating neural word embeddings with deep image representation using fisher vectors (CVPR 2015)

4 Skip-thought vector.

5 Flickr30k entities: Collecting region-to-phrase correspondencs for richer image-to-sentence models.

For automatic image captioning,

1 deep captioning with multimodal recurrent neural network(ICRL 2015)

2 explian images with multimodal recurrent neural network

3 Multimodal neural language model. (ICML 2014)

4 Unifying visual-semantic embeddings with multimodal neural language models

5 Show and tell: a neural image caption generator

6 Long-term recurrent convolutional networks for visual recognition and description.

 

Three stages of the model

1 Image CNN:       f:id:PDFangeltop1:20151210094350p:plain

get the CNN code using VGG model(very deep convotlutional networks)

matrix Wim: d * 4096. (d = 256) So finally we can get an image vector of dimension 256.

2 Matching CNN:

  2.1 word-level matching CNN                f:id:PDFangeltop1:20151210095502p:plain

w(l,f) are the paramters for the f featuref:id:PDFangeltop1:20151210095812p:plain

map on lth layer. krp is the windows size set for 3.

 f:id:PDFangeltop1:20151210095235p:plain

The image vector is concatenated with every word windows vector.

 A trick : to handle variable lengthes, the maximum length of sentence is fixed.

 f:id:PDFangeltop1:20151210095248p:plain

And max-pooling: f:id:PDFangeltop1:20151210095410p:plain

  2.2 Phrase-level maching CNN

    f:id:PDFangeltop1:20151210095513p:plain

    let CNN work solely on words to certain levels before interacting with the image.

  2.3 Sentence-level matching CNN

  f:id:PDFangeltop1:20151210095336p:plain

  3.4 Summing the score of four models together : MatchCNNens.

f:id:PDFangeltop1:20151210095426p:plain

 3 two-layer multilayer perceptron. 

 

Learning Phrase:  using a ranking loss function. 

f:id:PDFangeltop1:20151210101432p:plain

(xn,yn) is the corelated image-sentence pair. (xn,ym) is the randomly sampled uncorrelated image-sentence paris.

u = 0.5 ,and using early-stop , dropout(0.1)

 

Data set: Flickr8K, Flickr30K, Microsoft COCO.