Unifying Visual-semantic Embedding with Multimodal Neural Language Models

1 Introduction

This work use the framework of encoder-decode models to solve the problem of image-caption generation. For encoder, the model learn the joint sentence-image embedding where sentence embeddings are encoded using LSTM, and image embedding are encoded using CNN. A pairwise ranking loss is minimized in order to learn to rank images and their descriptions. For decoder, the structured-content neural lnaguage model generates sentence conditioned on distributed representations produced by the encoder.

f:id:PDFangeltop1:20160204150930p:plain 

2 Model Description

2.1 LSTM for modeling sentences

 

f:id:PDFangeltop1:20160204151005p:plain

2.2 Multimodal distributed representations

K, D, V the dimension of embedding space,  image vector,  vocabulary size.

Wi : K*D , Wt: K*V

For a sentence-image pair, v is the last hidden state of LSTM representing sentence,

x = Wi*q, q is CNN code.

We optimize the pair-wise ranking loss.

f:id:PDFangeltop1:20160204153933p:plain

 

2.3 Log-bilinear neural language models

We have first n-1 k-dimension words (w1,w2 ... wn-1) as context, now we want to predict the nth word,  Ci is the context parameter matrices.

(Every word has input vector wi and output vectors ri.)

f:id:PDFangeltop1:20160204154644p:plain

f:id:PDFangeltop1:20160204154703p:plain

2.4 Multiplicative neural language models

Suppose now we have a vector u from multimodal space, associated with a word sequence. u may be an embedded vector of an image. A multiplicative neural language model models the distribution P(wn=i|w1:n-1,u).

f:id:PDFangeltop1:20160204161747p:plain G = K,

f:id:PDFangeltop1:20160204161527p:plain

f:id:PDFangeltop1:20160204161533p:plainf:id:PDFangeltop1:20160204161539p:plain

 

f:id:PDFangeltop1:20160204161616p:plain

 

2.5 Structure-content neural language models

Suppose we are given s sequence of word-specific structure variables T{t1,t2...., tn}, along with a description S{w1, w2 .... ,wn}.Ti may be part-of-speech. 

So we model the distribution from the previous word context w1:n-1 and forward structure context tn:n+k, where k is the context size. 

f:id:PDFangeltop1:20160204173607p:plain

f:id:PDFangeltop1:20160204173754p:plain

If the condition vector u be the description sentence computed with LSTM, we could use a large amount of monolingual text to improve the quality of language model.