Spatial Transformer Networks

1 Introduction

A desirable property of a system which is able to reason about images is to disentangle object pose and part deformation from texture and shape.

In order to overcome the drawback that CNN's lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner, the author proposes a new learnable module, spatial transformer, which explicitly allows the spatial manipulation of data and is differentiable. It can be plugged into a standard Neural Network to provide spatially tranformation abilities.

今現在、空間的不変性を保てるCNNはまだ少ないです。でもそれは物体を認識するには非常に役たつ性質であります。犬の画像を部分的に抽出したり、平行移動したり、捻じ曲げたりしても犬であることに変わりがないはずです。それをCNNが正しく認識させるためにspatial network を導入します。画像をどう変更するかは入力依存であり、全体のコストを出来るだけ小さくするように変更のパラメーターも自動的に学習される。

f:id:PDFangeltop1:20160211153255p:plain

2 Spatial Transfers

The transformation is conditioned on the input feature, producing a single output feature  map.There are three parts, a localization network takes the input feature map, and through a number of hidden layers outputs the parameters of the spatial transformation that should be applied to the feature map. Then the predicted paramters are used to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output. This is done through a generator. At last, the feature map and the sampling grid are taken as inputs to the sampler, producing the output map .

 

2.1 Localization Network

f:id:PDFangeltop1:20160211154246p:plain  f:id:PDFangeltop1:20160211154249p:plain

Floc() can take any form but should include a final regression layer to produce the transoformation parameters.

 

3.2 Parameterised Sampling Grid

To perform a warping of the input feature map, each output pixel is computed by applying a sampling kernal centered at a particular location in the input feature map.

 

There are many kinds of transformations. For affine transformation teh formula is as below: 

f:id:PDFangeltop1:20160211154535p:plain

 (xs, ys)are source coordinates, (xt,yt) are target coordinates.

The transformation defined above allows cropping, translation, rotation, scale and skew to be applied to the input feature map, and require only 6 parameters.

f:id:PDFangeltop1:20160211154923p:plain

 3.3 Differentiable Image Sampling

The sampler takes the set of sampling points   f:id:PDFangeltop1:20160211155134p:plain  along with the input feature map U and produce the sampled output feature map V.

f:id:PDFangeltop1:20160211155413p:plain

f:id:PDFangeltop1:20160211155505p:plain

 

3.4 Some take-aways

Placing spatial transformers within a CNN allows the network  to learn how to actively transform the feature maps to help minimize the overall cost function of the network during traing.

It is also possible to use spatial transformers to downsample a feature map.

Finally, it is possible to have multiple spatial transformers in a CNN. Placing multiple spatial transformers at increasing depths of a network allow transformations of increasingly abstract representations. One can also use multiple spatial transformations in parallel -- this can be usefull if there are multiple objects or parts of interest in a feature map that should be focus on individually.