Unifying Visual-semantic Embedding with Multimodal Neural Language Models
1 Introduction
This work use the framework of encoder-decode models to solve the problem of image-caption generation. For encoder, the model learn the joint sentence-image embedding where sentence embeddings are encoded using LSTM, and image embedding are encoded using CNN. A pairwise ranking loss is minimized in order to learn to rank images and their descriptions. For decoder, the structured-content neural lnaguage model generates sentence conditioned on distributed representations produced by the encoder.
2 Model Description
2.1 LSTM for modeling sentences
2.2 Multimodal distributed representations
K, D, V the dimension of embedding space, image vector, vocabulary size.
Wi : K*D , Wt: K*V
For a sentence-image pair, v is the last hidden state of LSTM representing sentence,
x = Wi*q, q is CNN code.
We optimize the pair-wise ranking loss.
2.3 Log-bilinear neural language models
We have first n-1 k-dimension words (w1,w2 ... wn-1) as context, now we want to predict the nth word, Ci is the context parameter matrices.
(Every word has input vector wi and output vectors ri.)
2.4 Multiplicative neural language models
Suppose now we have a vector u from multimodal space, associated with a word sequence. u may be an embedded vector of an image. A multiplicative neural language model models the distribution P(wn=i|w1:n-1,u).
G = K,
2.5 Structure-content neural language models
Suppose we are given s sequence of word-specific structure variables T{t1,t2...., tn}, along with a description S{w1, w2 .... ,wn}.Ti may be part-of-speech.
So we model the distribution from the previous word context w1:n-1 and forward structure context tn:n+k, where k is the context size.
If the condition vector u be the description sentence computed with LSTM, we could use a large amount of monolingual text to improve the quality of language model.