**Problem discussed: **Reducing computations in neural networks by using two subnetworks

**Method: **The authors introduce dynamic capacity network (DCN) which consists of two subnetworks: high capacity network and low capacity network. The low capacity network uses a gradient-based hard attention process which finds key locations for classification and then the patches around these locations are fed through the high capacity network to get the final classification output. One of the key contributions of this paper is that the hard attention mechanism does not require a policy network trained by reinforcement learning and DCNs can be trained end to end with backpropagation. The network is composed of two subnetworks for bottom layers coarse network (low capacity) f_{c} and fine network(high capacity) f_{f} . The top layers g take the input from bottom layers and output the probability distribution of the classes.

For an input image **x, **first they compute the coarse representation f_{c}(x) using low capacity network

where c_{i,j}= f_{c}(x_{i,j}) , s_{1} and s_{2} are spatial dimensions of the feature map. The output of the model is then computed on the coarse features o_{c}= h_{c}(x)= g(f_{c}(x)). The saliency measure used is:

where C is the number of classes. The saliency map M can be computed by taking the norm of the gradient of H with respect to c_{i,j}. From saliency map, one can select highly salient patches as input to fine layers.

The parameters for the network are learned by optimizing the cross entropy function

where

.

They add an additional term to the loss function which computes the distance between coarse and fine representations to encourage similarity between the coarse and fine representation. This term is used only to learn the parameters of coarse layers, and the input to this terms are the selected salient patches.

The experiments were performed on cluttered MNIST and SVHN datasets. The DCN shows significant improvements over the previous methods. The DCN model reduces the computation and therefore the inference time as compared to a single high capacity model.

I am going to try class activation maps discussed in one of my previous posts to generate saliency maps which do not require backpropagation step.

**Reference:**

Dynamic Capacity Networks. *Amjad Almahairi, Nicolas Ballas, Tim Cooijmans, Yin Zheng, Hugo Larochelle, Aaron Courville*

### Like this:

Like Loading...