An Improved Image Transformation Network for Neural Style Transfer

. By using Convolutional Neural Networks(CNNs), the semantics and styles of images can be separated and recombined to create fascinating images. In this paper, an image transformation network for style transfer is proposed, which consists of convolution layers, deconvolution layers and Fusion modules composed of two 1x1 convolution layers and a residual block. The output of each layer in the network is normalized using batch normalization to speed up the training process. Compared with other networks, our network has fewer parameters and better real-time performance while generating similar quality images..


Introduction
For centuries, painting has been a popular form of art, having produced plenty of valuable masterpieces which attract people's attention.But in the past, it would take a long time for a well-trained artist to draw a painting of a particular style.
Recently, Gatys et al. first studied how to use CNN to reproduce famous painting styles in the natural picture.They obtained the representations of the image from the CNN and found that the content of image and the style of image were separable.Based on the above findings, Gatys et al. proposed a Neural Style Transfer algorithm [1] to recombine the contents of a given image and the style of famous artworks.However, the efficiency of his algorithm can't meet the real-time requirements.Johnson et al. introduced a fast method based on the algorithm proposed by Gatys et al.Firstly, they trained an equivalent feed-forward generator network by using the perceptual loss function [2] they proposed for each particular style.The perceptual loss function calculates the loss by using high-level features extracted from the images using the 16-layer VGG network [3] pretrained on ImageNet dataset [4].When there is a content image to be stylized, only a single forward transfer is required to produce the result.
Due to the amazing stylized results, the study of Neural Style Transfer has led to many successful industrial applications.The mobile application Prisma [5] is one of the first industrial applications that offer the Neural Style Transfer algorithm as a service.Before Prisma appeared, people never thought that one day they could turn their images into artworks in just a few minutes.In order to meet the growing needs of the mobile end, a smaller and faster network is urgently needed.
In this paper, a new module which can be used to construct an image transformation network is proposed.In order to train the network, the pretrained 16-layer VGG network is used to extract the advanced features of the image, then train our network by minimizing the perceptual loss function.At test-time, compared with the network of Johnson et al., our transformation networks reduce network parameters by 62.3% and the running time reduced by 12%.

Feed-forward Image Transformation
In recent years, training a deep convolutional neural network can solve many image transformation problems.Since the purpose of our image transformation network is to convert an image into a stylized image, we referred to the architecture of the Fully Convolutional Networks [6] and the Deconvolution Network [7].In the architecture of our image transformation network, instead of using the pooling layer, the convolution layer and the deconvolution layer are used to perform the down-sampling and up-sampling operations.

Neural Style Transfer
The method proposed by Gatys et al. starts with random noise and the stylized image constantly by back propagation.The method of Johnson et al. is to train a feedforward network on a large image dataset for each particular style of image.Gradient descent is used to optimize the network by iteratively updating the network.Those two methods use similar objective function.
The perceptual loss function is improved by Johnson et al. on the basis of Gatys et al.Given the style and content images   and   , and the layer j and J used in the network ϕ for feature and style reconstruction, the stylized images can be generated by minimizing the total loss.loss =     , (,   ) +     , (y,   ) +     () (1)   ,   and   are scalars representing the weights of feature reconstruction loss, style reconstruction loss and total variation regularizer in the total loss.In this paper, the network ϕ is the pretrained 16-layer VGG network.

Batch Normalization
Because of change in the parameters of the previous layer, the distribution of each layer's inputs changes during training.It makes the depth neural network difficult to train.
To solve this problem, Ioffe et al. proposed batch normalization [8].By using batch normalization, the internal covariate shift is reduced and the training process of network is shortened greatly.Batch normalization is implemented by normalizing each feature map so that the mean is zero and the variance is one.For a layer with d-dimensional input x = ( (1) …  () ), each dimensional will be normalized where the expectation and variance are computed over the training data set.
In this paper, batch normalization is added after the output of each convolutional and deconvolutional layer.

Residual Connection
He et al. found the use of residual connections [9] in the network which made it possible to train deep convolutional neural networks.They used residual connections on various datasets and proved the effect of residual connections.The residual block is defined as: y = F(, {  }) +  (3) Here x and y are the input and output vectors of the layers.The function F(, {  }) is a residual mapping that needs to be learned.The structure of the residual block is shown in Fig. 2.

Image Transformation Network
A module we designed with fewer parameters is proposed in this part.First, we introduced design ideas of modules with fewer parameters.Then, we proposed a module called Fusion which enabled us to build an image transformation.

Design Idea
Our objective is to define a CNN module with fewer parameters while keeping the image transformation network capability of generating similar quality images.To achieve it, the following ideas are used in the design of module: Use 1x1 filters.Given a certain number of filters, the module will use more of the 1x1 filters because the parameters of 1x1 filters are 9 times less than the parameters of 3x3 filters.
Reduce the number of input channels to the 3x3 filters.When the layer contains 3x3 filters, the convolution layer will have (number of input channels) * (number of filters) * (3 * 3) parameters.Obviously, the parameters of the convolution layer can be effectively reduced by reducing the number of filters and the number of input channels.It is crucial to reduce the number of input channels and the number of filters to reduce the convolution layer parameters.These ideas will be applied to getting a novel module and utilizing it on the main body of the image transformation network.The first layers of the network are the convolution layers used to downsampling, and the last layers of the network are the deconvolution layers for upsampling.

The Fusion Module
The Fusion module is defined as follows.y = joint{ 1×1 2 ( 1 ×1 1 ), ( 1×1 1 ) +  1×1 1 } (4) Here, x and y are module inputs and outputs, the function F is the residual mapping mentioned above. 1×1 1 is the first 1x1 convolution layer used to reduce the number of input channel.Then, the output of the first 1x1 convolution layer is fed into second 1x1 convolution layer  1 ×1 2 and a residual block respectively.Finally, the output of the second 1x1 convolution layer and the output of the residual block are jointed as output of the module.
In the Fusion module, the residual block consists of two 3x3 convolution layers.The output of each convolution layer is normalized by batch normalization and the activation function is ReLU.The Fusion module is illustrated in Fig. 3.

The Transformation Network Architecture
The transformation network architecture is shown in Fig. 4. The image transformation network starts with three convolution layers used for downsampling, followed by 5 Fusion modules, and finally ends with three deconvolution layers used for upsampling to generate the final image.
In Fig. 5, it is obvious that parameters in our network are much lower than those in

Experiments
We use the perceptual loss function as the objective function to train the image transform network.The Microsoft COCO dataset [10] is used to train our image transformation network.Our implementation uses Tensorflow [11] and cuDNN [12].The whole training process is accomplished on a single GTX 950M GPU.At the same time, we also have reimplemented the work of Johnson et al.In Fig. 6, our image transformation network has generated images with similar quality to Johnson et al.
In Table 1, we compared the runtime of two networks on different sizes of images.In different sizes of images, our network has a faster speed.Compare to the network of Johnson et al., our network is 12% faster on average.Experiments show that our network can meet the higher real-time requirements.
Since the image transformation network is a fully convolutional neural network, it can be applied to any resolution image as long as the machine memory is enough.

Conclusion
In this paper, Fusion module is proposed based on some design ideas, and an image transformation network is constructed with the module to obtain a network with fewer parameters and better real-time performance.Experimental results show that, compared with the network of Johnson et al., the number of parameters in our network is greatly reduced and the runtime is shortened when similar quality images are generated.
In the future, we would like to explore a better network architecture in image transformation tasks.We hope that an image transformation network with smaller and higher real-time performance can be applied to smart phones, which enable them to partly break away from the constraints of mobile phone hardware conditions.

Fig. 5 .
Fig. 5. Parameters for each layer of the network.(a)The network of Johnson et al.(b) Our network.

Fig. 6 .
Fig. 6.(a) The content images.(b) The style images.(c) Results of Johnson et al.(d) Our results.
Johnson et al.Compared with their network, our network parameters decreased by 62.3%.

Table 1 .
For different sizes of images, the table lists the speed of our network and the speed of the network of Johnson et al.Our network gets stylized images of similar quality but is faster than the network of Johnson et al.Both methods are implemented on GTX 950M GPU.