Deep Fragment Embedding for bidirectional Image sentence mapping

1 Introduction:

This model works on a finer level and embeds fragments of images(objects) and fragments of sentences into a common space. And the paper states that both global level of images and sentences and the finer level of their respective fragments improve performance on image-sentence retrieval tasks.

2 Model Description

Detect objects as image fragments and use sentence dependency tree relations as sentence fragments.

Embed the images and sentences into a common space and the parameters are trained such that true image-sentence pairs have an inner product(interpreted as a score) higher than false image-sentence pairs by a margin.

2.1 Dependency tree relations as sentence fragments

For every dependency triplet (R,w1,w2) :

f:id:PDFangeltop1:20151212194817p:plain

We 400,000 * d. fixed. The dimension s is cross-validated.

2.2 Object detections as image fragments

Detect object in image with a Region CNN.And use the 19 detected locations and the entire image as the image fragments and compute the embedding vectors based on pixels Ib inside each bounding box as follows.

f:id:PDFangeltop1:20151212195015p:plain

CNN(Ib) returns the 4096-dimensional activations of the fully connected layer immediately before the classifier.

2.3 objective function

f:id:PDFangeltop1:20151212195846p:plain

a) Fragment Alignment objective : if a sentence contains a fragment , at least one of the boxes in the corresponding image should have a high score with this fragment.

f:id:PDFangeltop1:20151212200151p:plain Constant Kij normalize the objective with respect to the number of positive and negative yij and yij = 1 if vi and sj occur together in a corresponding image-sentence pair.

b) Global Ranking objective: ensures that the computed image-sentence similiarities are consisent with the ground truth annotation.

f:id:PDFangeltop1:20151212200513p:plain

f:id:PDFangeltop1:20151212200515p:plain

delta are cross-validated.

2.4 Optimization

SGD, batchsize 100, momentum of 0.9. 15 epoches and fune tune RCNN after first 10 epoches.

3 Conclusion

From a modeling perspective, sentence are only modeled as bags of relations. And the model is incapable of counting. In the future authors wants to extend the model to support counting, reasoning about spatial positions of obects, and move beyond bags of framents.