DenseCap: Fully Convolutional Localisation Networks for Dense Captioning

1 Introduction

This paper addresses the object localization and image caption jointly by proposing a fully convolutional localization network. (FCLN). The architecture is composed of a Convnet, a novel dense localization layer, and a RNN language model that generates label sequences. The goal is to design an architecture that joinly localizes regions of interest and then describes each with natural language.

2 Model Architecture

2.1 Convolutional Network.

VGG-16 architecture. It consists of 13 layers of 3*3 convolutions interspersed with 5 layers of 2*2 max pooling. We remove the final pooling layer so an input image of shape 3*W*H gives rise to a tensor of features of shape C*H'*W'. where C = 512, H' = H/16, W' = W/16.

f:id:PDFangeltop1:20160226122507p:plain

2.2 Fully Convolutional Localization Layer

The layer receives an input tensor of activations, identifies spatial region of interest, and smoothly extracts a fixed-size representation from each region.

A) Inputs/Outputs:

The localization layer accepts a tensor of activations of size C*W'*H' and selects B regions of interest and returns three output tensors giving information about reach region.

1 Region Coordinates. a matrix of B*4

2 Region Scores: A vector of B

3 Region Features: A tensor of shape B*C*X*Y giving features for each region, which is represented by and X*Y grid of C channels.

B)Convolutional Anchors:

We project each point in W'*H' grid of input features back into the W*H iamge plane. and consider k anchors boxes of different aspect ratios centered at this projected point.

More specifically , we produce 5k*W'*H', each grid 5*k numbers containing 4 coordinate of region in W*H and a confident score for each anchors.

C)Box Regression

D)Box Sampling : we do not consider all region proposals, just a part of them.

E) Bilinear Interpolation:

Extract a fixed-size feature representation for each variably sized region proposal.

Given an input feature map U of shape C*W'*H' and a region proposal, we interpolate the features of U to produce an output feature map V of shape C*X*Y. After projecting the region proposal onto U we compute a sampling grid G of shape X*Y*2 associating each element of V with real-valued coordinates into U.

f:id:PDFangeltop1:20160226125626p:plain