Problem discussed: Transfer styles from artistic images to real life images.
Image Style Transfer Using Convolutional Neural Networks: This work introduced the idea of transferring style using neural networks. The idea behind the work is to extract the semantic information from the content image(real life image) and then transfer the textures from the style image for rendering. For doing this one needs to find representations that model semantic content and the style independently. Deep convolutional neural networks learn different representations of the image at different levels and thus are suitable for this task. Semantic information is generally preserved in the deeper layers and while low-level information is represented in the initial layers. The authors verify this by reconstructing the images from different layers and observe that pixel information is lost in the deeper layers while semantic information is preserved. The figure below shows the reconstruction of the content image from different layers of the network.
The style reconstructions shown in the figure are constructed by combining the sets of images reconstructed from different layers. From left to right, the style reconstructions combine the features from the layer below and the layers on its left to generate the final representation. This discards the object level information or the arrangement of the scene but preserves the style.
Now coming to how it’s done in practice. The authors start with an image initialized with the white noise and optimize the loss to generate the style transferred output. The optimization is done keeping in mind to preserve the semantic information from the content image and style information from the style image. Thus the loss term for backpropagation consists of two terms a style loss and a content loss. The figure below shows how both these losses are computed.
Fl is the response of the layer l which is a NlxMl feature map where M is height times the width of the feature map and Nl is the number of filters in that layer. The content loss is squared error loss between the feature representation the content image and the output image. For style image loss they design a feature space which captures the texture information. This feature space is constructed by finding the correlations between different filter responses. The correlations are computed by constructing a gram matrix for Gl for each layer. The element Glij of the Gram matrix Gl ∈ RNlxNl is the dot product of the ith and jth feature map of lth layer.
The optimization here jointly tries to reduce the content and style loss. By manipulating the weights of style and content loss one can decide how much emphasis has to be put on the content and style.
The result generated using this method are aesthetically appealing and have attracted the attention of many researchers. However, to generate these result one has to do the optimization which can involve multiple iterations. This problem is mitigated to some extent in the next posts I am going to discuss.
The authors also comment that while transferring style from a real life image to another real life image, the synthesized image contains some low-level noise which is not visible when style image is a painting.