DenseCap: Fully Convolutional Localisation Networks for Dense Captioning
1 Introduction
This paper addresses the object localization and image caption jointly by proposing a fully convolutional localization network. (FCLN). The architecture is composed of a Convnet, a novel dense localization layer, and a RNN language model that generates label sequences. The goal is to design an architecture that joinly localizes regions of interest and then describes each with natural language.
2 Model Architecture
2.1 Convolutional Network.
VGG-16 architecture. It consists of 13 layers of 3*3 convolutions interspersed with 5 layers of 2*2 max pooling. We remove the final pooling layer so an input image of shape 3*W*H gives rise to a tensor of features of shape C*H'*W'. where C = 512, H' = H/16, W' = W/16.
2.2 Fully Convolutional Localization Layer
The layer receives an input tensor of activations, identifies spatial region of interest, and smoothly extracts a fixed-size representation from each region.
A) Inputs/Outputs:
The localization layer accepts a tensor of activations of size C*W'*H' and selects B regions of interest and returns three output tensors giving information about reach region.
1 Region Coordinates. a matrix of B*4
2 Region Scores: A vector of B
3 Region Features: A tensor of shape B*C*X*Y giving features for each region, which is represented by and X*Y grid of C channels.
B)Convolutional Anchors:
We project each point in W'*H' grid of input features back into the W*H iamge plane. and consider k anchors boxes of different aspect ratios centered at this projected point.
More specifically , we produce 5k*W'*H', each grid 5*k numbers containing 4 coordinate of region in W*H and a confident score for each anchors.
C)Box Regression
D)Box Sampling : we do not consider all region proposals, just a part of them.
E) Bilinear Interpolation:
Extract a fixed-size feature representation for each variably sized region proposal.
Given an input feature map U of shape C*W'*H' and a region proposal, we interpolate the features of U to produce an output feature map V of shape C*X*Y. After projecting the region proposal onto U we compute a sampling grid G of shape X*Y*2 associating each element of V with real-valued coordinates into U.
2.3 Recogination Network
a fully-connected neural network that processes region features from localization layer.
2.4 RNN language Model
use region code only at time stamp 1.