Problem discussed: Reducing computations in neural networks by using two subnetworks

Method: The authors introduce dynamic capacity network (DCN) which consists of two subnetworks: high capacity network and low capacity network. The low capacity network uses a gradient-based hard attention process which finds key locations for classification and then the patches around these locations are fed through the high capacity network to get the final classification output. One of the key contributions of this paper is that the hard attention mechanism does not require a policy network trained by reinforcement learning and DCNs can be trained end to end with backpropagation. The network is composed of two subnetworks for bottom layers coarse network (low capacity) fc and fine network(high capacity)  ff . The top layers g take the input from bottom layers and output the probability distribution of the classes.

For an input image x, first they compute the coarse representation fc(x) using low capacity network

Capture3

where ci,j= fc(xi,j) , s1 and s2 are spatial dimensions of the feature map. The output of the model is then computed on the coarse features oc=  hc(x)= g(fc(x)). The saliency measure used is:

Capture4

where C is the number of classes. The saliency map M can be computed by taking the norm of the gradient of H with respect to ci,j. From saliency map, one can select highly salient patches as input to fine layers.

The parameters for the network are learned by optimizing the cross entropy function

Capture5

where

Capture6.

They add an additional term to the loss function which computes the distance between coarse and fine representations to encourage similarity between the coarse and fine representation. This term is used only to learn the parameters of coarse layers, and the input to this terms are the selected salient patches.

The experiments were performed on cluttered MNIST and SVHN datasets. The DCN shows significant improvements over the previous methods. The DCN model reduces the computation and therefore the inference time as compared to a single high capacity model.

I am going to try class activation maps discussed in one of my previous posts to generate saliency maps which do not require backpropagation step.

Reference:

Dynamic Capacity Networks. Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, Aaron Courville

Advertisements