DenseCap: Fully Convolutional Localisation Networks for Dense Captioning

1 Introduction

This paper addresses the object localization and image caption jointly by proposing a fully convolutional localization network. (FCLN). The architecture is composed of a Convnet, a novel dense localization layer, and a RNN language model that generates label sequences. The goal is to design an architecture that joinly localizes regions of interest and then describes each with natural language.

 

2 Model Architecture 

2.1 Convolutional Network.

VGG-16 architecture. It consists of 13 layers of 3*3 convolutions interspersed with 5 layers of 2*2 max pooling. We remove the final pooling layer so an input image of shape 3*W*H gives rise to a tensor of features of shape C*H'*W'. where C = 512, H' = H/16,   W' = W/16.

f:id:PDFangeltop1:20160226122507p:plain

 

2.2 Fully Convolutional Localization Layer

The layer receives an input tensor of activations, identifies spatial region of interest, and smoothly extracts a fixed-size representation from each region.

 

 A) Inputs/Outputs:

     The localization layer accepts a tensor of activations of size C*W'*H' and selects B regions of interest and returns three output tensors giving information about reach region.

    1 Region Coordinates.  a matrix of B*4 

    2 Region Scores: A vector of B

    3 Region Features: A tensor of shape B*C*X*Y giving features for each region, which is represented by and X*Y grid of C channels.

    

B)Convolutional Anchors:

      We project each point in W'*H' grid of input features back into the W*H iamge plane.      and consider k anchors boxes of different aspect ratios centered at this projected point.

More specifically , we produce 5k*W'*H', each grid 5*k numbers containing 4 coordinate of region in W*H and a confident score for each anchors. 

C)Box Regression

D)Box Sampling : we do not consider all region proposals, just a part of them.   

E) Bilinear Interpolation:

   Extract a fixed-size feature representation for each variably sized region proposal.

   Given an input feature map U of shape C*W'*H' and a region proposal, we interpolate the features of U to produce an output feature map V of shape C*X*Y. After projecting the region proposal onto U we compute a sampling grid G of shape X*Y*2 associating each element of V with real-valued coordinates into U.

          f:id:PDFangeltop1:20160226125626p:plain

 2.3 Recogination Network

  a fully-connected neural network that processes region features from localization layer.

 

2.4 RNN language Model

  use region code only at time stamp 1.