Diversity Regularized Adversarial Deep Learning

. The two key players in Generative Adversarial Networks (GANs), the discriminator and generator, are usually parameterized as deep neural networks (DNNs). On many generative tasks, GANs achieve state-of-the-art performance but are often unstable to train and sometimes miss modes. A typical failure mode is the collapse of the generator to a single parameter conﬁguration where its outputs are identical. When this collapse occurs, the gradient of the discriminator may point in similar directions for many similar points. We hypothesize that some of these shortcomings are in part due to primitive and redundant features extracted by discriminator and this can easily make the training stuck. We present a novel approach for regularizing adversarial models by enforcing diverse feature learning. In order to do this, both generator and discriminator are regularized by penalizing both negatively and positively correlated features according to their diﬀerentiation and based on their relative cosine distances. In addition to the gradient information from the adversarial loss made available by the discriminator, diversity regularization also ensures that a more stable gradient is provided to update both the generator and discriminator. Results indicate our regularizer enforces diverse features, stabilizes training, and improves image synthesis.


Introduction
Convolutional neural networks (CNNs) have become the powerhouse for tackling many image processing and computer vision tasks.By design, CNNs learn to automatically optimize a well-defined objective function that quantifies the quality of results and their performance on the task at hand.As shown in previous studies [1], designing effective loss functions for many image prediction problems is daunting and often requires manual effort and in-depth experts' knowledge and insights.For instance, naively minimizing the Euclidean distance between predicted and ground truth pixels have shown to result in blurry outputs since the Euclidean distance is minimized by averaging all conceivable outputs [1][2][3][4].One plausible way of training models with high-level objective specifications is by allowing CNNs to automatically learn the appropriate loss functions that satisfy these desired objectives.One of such objectives could be as simple as asking the model to make the output not distinguishable from the groundtruth.
As established in [1,[5][6][7], GANs are trained to automatically learn an objective function using a discriminator network to classify if its input is real or synthesized while simultaneously training a generative model to minimize the loss.In GAN framework, both the discriminator and generator aim to minimize their own loss and the solution to the game is the Nash equilibrium where neither player can independently improve their individual loss [5,8].This framework can also be interpreted from the viewpoint of a statistical divergence minimization between the learned model distribution and the true data distribution [9][10][11].
Even though GANs have resulted in new and interesting applications and achieved promising performance, they are still hard to train and very sensitive to hyperparameter tuning.A peculiar and common training challenge is the performance control of the discriminator.The discriminator is usually inaccurate and unstable in estimating density ratio in high dimensional spaces, thus leading to situations where the generator finds it difficult to model the multi-modal landscape in true data distribution.In the event of total disjoint between the supports of model and true distributions, a discriminator can trivially distinguish between model distribution and that of true data [12], thus leading to situations where generator stops training because the derivative of the resulting discriminator with respect to the input has vanished.This problem has seen many recent works to come up with workable heuristics to address many training problems such as mode collapse and missing modes.
We argue in line with the hypothesis that some of the problems associated with the training of GANs are in part due to lack of control of the discriminator.In light of this, we propose a simple yet powerful diversity regularizer for training GANs that encourages the discriminator to extract near-orthogonal filters.The problem abstraction is that in addition to the gradient information from the adversarial loss made available by the discriminator, we also want the GAN system to benefit from extracting diverse features in the discriminator.Experimental results consistently show that, when correctly applied, the proposed regularization enforces diverse features in the discriminator and better stabilize the GAN training with mostly positive effects on the generated samples.
The contribution of this work is two-fold: (i) we propose a new method to regularize adversarial learning by inhibiting the learning of redundant features and availing a stable gradient for weights updates during training and (ii) we show that the proposed method stabilizes the adversarial training and enhances the performance of many state-of-the-art methods across many benchmark datasets.The rest of the paper is structured as follows: Section II highlights the state-of-the-art and Section III discusses in detail the formulation of diversity-regularized adversarial learning.Section IV discusses the detailed experimental designs and presents the results.Finally, conclusions are drawn in Section V.

Related Work
As originally introduced in [5], GANs consist of generator and the discriminator that are parameterized by deep neural networks and are capable of synthesizing interesting local structure on select datasets.The representation capacity of original GAN was extended in conditional GANs [13] by incorporating an additional vector that enables the generator to synthesize samples conditioned on some useful information.This extension has motivated several conditional variants of GAN in diverse applications such as edge map [14,15], image synthesis from text [16], super-resolution [17], style transfer [18], just to mention a few.Learning useful representation with GANs has shown to heavily rely on hyperparameter-tuning due to various instability issues during training [8,12,19].GANs are remarkably hard to train in spite of their success on variety of task.Robustly and systematically stabilizing the training of GANs has come in many forms such as selective architectural design [6], matching of intermediate features [7], and unrolling the optimization of discriminator [20].Many recent advances inspired by either theoretical insights or practical considerations have been attempted in form of regularization and normalization to address some of the issues associated with training of GANs.Imposing Lipschitz constraint on the discriminator has shown to stabilize the adversarial training and avoid an overoptimization scenario where the discriminator still distinguishes and allots different scores to nearly indistinguishable samples [12].By satisfying the Lipschitz constraint, the discriminator's joint/compressed representation of the true and synthesized data distributions is guaranteed to be smooth; thus ensuring a non-zero learning signal for the generator [12,19].Enforcing the discriminator to satisfy the Lipschitz constraints has been approximated and implemented via ancillary means such as gradient penalties [21] and weight clipping [12].Using a Gaussian classifier over the real/fake indicator variables has also been shown to have a smoothing effect on the discriminator function [19].Injecting label noise [7] and gradient penalty have equally been shown to have a tremendous regularizing effect on GANs.Schemes such as weighted gradient [22] and missing modes penalty [23] have been utilized to alleviate some training and missing modes issues in GAN learning.
Weight vectors of discriminator have been l 2 -normalized with Frobenius norm, which constraints the sum of the squared singular values of the weight matrix to be 1 [7].However, normalizing using Frobenius norm translates to utilizing a single feature to discriminate the model probability distribution from the target thus, reducing the rank and hence the number of discriminator features [24].In addition to weight clipping [10,12], weight normalization approaches yield primitive discriminator model that maps the target distribution only with select few features.The most closely related work to ours is orthonormal regularization of weights [25] that sets all the singular values of weight matrix in the discriminator to one, which translates to using as many features as possible to distinguish the generator distribution from the target distribution.Our approach, however, imposes much softer orthogonality constraint on the weight vectors by allowing a degree of feature sharing in upper layers of the discriminators.Other related work is spectral normalization of weights that guarantees 1-Lipschitzness for linear layers and ReLu activation units resulting in discriminators of higher rank [24].The advantage of spectral normalization is that weight matrices are constrained and Lipschitz.However, bounding the spectral norm of the convolutional kernel to 1 does not bound the spectral norm of the convolutional mapping to unity.

Method
The training of GAN can be abstracted as a non-cooperative game between two players, namely the generator G and the discriminator D. The discriminator tries to distinguish if the generated sample is from the real (p data ) or fake data distribution (p z ), while G tries to trick D into believing that generated sample is from p data by moving the generation manifold towards the data manifold.The discriminator aims to maximize E x∼p data (x) [logD(x)] when the input is sampled from real distribution and given a fake image sample G(z), z ∼ p z (z), it is trained to output probability, D(G(z)), close to zero by maximizing The generator network, however, is trained to maximize the chances of D producing a high probability for a fake image sample G(z) thus by minimizing The adversarial cost is obtained by combining the objectives of both D and G in a min-max game as given in 1 below: Training D can be conceived as training an evaluation metric on sample space [23] that enables G to use the local gradient ∇ log D(G(z)) information made available by D to improve itself and move closer to the data manifold.

Feature diversification in GAN
Both D and G are commonly parameterized as DNNs and over the past few years, the general trend has been that DNNs have grown deeper, amounting to huge increase in number of parameters.The number of parameters in DNNs is usually very large offering possibility to learn very flexible high-performing models [26].Observations from many previous studies [27][28][29][30] suggest that layers of DNNs typically rely on many redundant filters that can be either shifted version of each other or be very similar with little or no variations.For instance, this redundancy is evidently pronounced in filters of AlexNet [31] as emphasized in [28,32,33].To address this redundancy problem, we train layers of the discriminator under specific and well-defined diversity constraints.
Since G and D rely on many redundant filters, we regularize them during training to provide more stable gradient to update both G and D. Our regularizer enforces constraints on the learning process by simply encouraging diverse filtering and discourages D from extracting redundant filters.We remark that convolutional filtering has found to greatly benefit from diversity or orthogonality of filters because it can alleviate problems of gradient vanishing or exploding [25,[34][35][36].
Typically, both D and G consist of input, output, and many intermediate processing layers.By letting the number of channels, height, and width of input feature map for l th layer be denoted as n l , h l , and w l , respectively.A convolutional layer in both D transforms input x l ∈ R p into output x l+1 ∈ R q , where x l+1 is the input to layer l + 1; p and q are given as n l × h l × w l and n l+1 × h l+1 × w l+1 , respectively.x l is convolved with n l+1 3D filters χ ∈ R n l ×k×k , resulting in n l+1 output feature maps.Unrolling and combining all layer l th filters into a single matrix results in kernel matrix θ D i ∈ R m corresponds to the i-th column of the kernel matrix ; the bias term of each layer is omitted for simplicity.
Given that Θ D ∈ R m×n l contain n l normalized filter vectors as columns, each with m elements corresponding to connections from layer l − 1 to i th neuron of layer l, then, the diversity loss J D for all layers of D is given as: where Θ D which contains the inner products of each pair of columns i and j of (l) Ω D in layer l; (l) M D ∈ R n l ×n l is a binary mask for layer l defined in (5); L is the number of layers to be regularized. (l) Similarly, the diversity loss J G for generator G is given as: In order to enforce feature diversity in both G and D while training GANs, the diversity regularization terms in (4) is added to the conventional adversarial cost J adv in (1) as given in (6). where ), λ G and λ D is the diversity penalty factors for generator and discriminator, respectively.The derivative of diversity loss J D with respect to weights of D is given as and the derivative of diversity loss J G with respect to weights of G is The idea behind diversifying features is that in addition to adversarial gradient information provided by D, we provide additional diversity loss with more stable gradient to refine both G and D. The diversity loss encourages weights of both generator and discriminator to be diverse by pushing them towards the nearest orthogonal manifold.Our proposed regularization provides more efficient gradient flow, a more stable optimization, richness of layer-wise features of resulting model, and improved sample quality compared to benchmarks and baseline.The diversity regularization ensures the column space of (l) Θ G for l th layer does not concentrate in few direction during training thus preventing them to be sensitive in few and limited directions.The proposed diversity regularized adversarial learning alleviates some of the main failure mode of GAN by ensuring features are diverse.

Experiments
All experiments were performed on Intel(r) Core(TM) i7-6700 CPU @ 3.40Ghz and a 64GB of RAM running a 64-bit Ubuntu 16.04 edition.The software implementation has been in PyTorch library1 on two Titan X 12GB GPUs.Implementation of DiReAL is available at https://github.com/keishinkickback/DiReAL.Diversity regularized adversarial learning (DiReAL) was evaluated on MNIST dataset of handwritten digits [37], CIFAR-10 [38], STL-10 [39], and Celeb-A [40] databases.In the first set of experiments, an ubiquitous deep convolutional GAN (DCGAN) in [6] was trained using MNIST digits.The standard MNIST dataset has 60000 training and 10000 testing examples.Each example is a grayscale image of an handwritten digit scaled and centered in a 28 × 28 pixel box.Both the discriminator and generator networks contain 5 layers of convolutional block.Adam optimizer [41] with batch size of 64 was used to train the model for 100 epochs and τ and learning rate in DiReAL were set to 0.5 and 0.0001, respectively.In similar vein, λ D and λ G were to 1.0 and 0.01, respectively.Adam optimizer (β 1 = 0.0, β 2 = 0.9) [41] with batch size of 64 was used to train the model for 100 epochs Fig. 2 shows the diversity loss of both generator and discriminator for DiReAL and unregularized counterpart.It can be observed that DiReAL was able to minimize the pairwise feature correlations compared to the highly correlated features extracted by the unregularized counterpart.Specifically, DiReAL was able to steadily minimize the diversity loss as training progresses compared to the unregularized DCGAN, where extraction of similar features grows with epoch of training, thus increasing the diversity loss.The divergence between discriminator output for real handwritten digits and generated samples over 30 batches for regularized and the unregularized networks is shown in Fig. 3a.The divergence was measured using the Wasserstein distance measure [42] and it can be observed that the regularizing effect of DiReAL stabilizes the adversarial training and prevents mode collapse.For unregularized network, however, the mode started to collapse around 45th epoch.Closer look into the diversity of the generator in Fig. 2a, it is evident that just around the epoch of collapse the generator starts extracting more and more redundant filters.We suspect that DiReAL was able to stabilize the training by pushing features to lie close to the orthogonal manifold, thus preventing learned features from collapsing to an undesirable manifold.Fig. 3b shows the handwritten digit samples synthesized with and without DiReAL and it can be observed that diversification of features is beneficial for stabilizing adversarial learning and ultimately improving the samples' quality.Another observation is that DiReAL also prevents learned weights from collapsing to an undesirable manifold thus highlighting some of the benefits of pushing weights near the orthogonal manifold.
In the second large-scale experiments, CIFAR-10 dataset was used to train GAN   using DiReAL and the results compared to the unregularized training.The dataset is split into 50000 and 10000 training and testing sets, respectively.Similar to experiments with MNIST, Fig. 4b shows the diversity loss of the discriminator with and without DiReAL trained on CIFAR-10 database.It can be observed that DiReAL was able to minimize the diversity loss and encourages diverse features that benefit the adversarial training.On the other hand, Fig. 4b shows that the diversity loss of the unregularized is higher and unconstrained compared to that of DiReAL.The images Method Inception Score Real data 9.04 -Standard CNN-Unregularized [6] 4.00 ± 0.15 DiReAL (ours) 4.17 ± 0.03 Batch Normalization [43] 5.48 ± 0.19 Layer Normalization [44] 5.05 ± 0.12 Weight Normalization [45] 4.66 ± 0.14 Spectral Normalization [24] 6.50 ± 0.  synthesized with DiReAL was compared and contrasted with state-of-the-art methods such as batch normalization [43], layer normalization [44], weight normalization [45], and spectral normalization [24].It is remarked that DiReAL can be used in tandem with the other regularization techniques and could also be deployed as stand-alone regularization tool for stabilizing adversarial learning.In this light, DiReAL was also combined with these techniques.It must be noted that spectral normalization uses a variant of DCGAN architecture with an eight-layer discriminator network.See [24] for more implementation details.
It can be observed in Fig. 5 that diversity regularization was able to synthesize more diverse and complex images compared to unregularized counterpart.Other benchmark regularizers were able to generate better image samples compared to using only Di-ReAL.However, when DiReAL was combined with other regularizers the quality of the generated samples was significantly improved.For quantitative evaluation of generated examples, inception score metric [45] was used.Inception score has been found to highly correlate with with subjective human judgment of image quality [24,45].Similar to [24,45], inception score was computed for 5000 synthesized images using generators trained with each regularization technique.Every run of the experiment is repeated five times and averaged to combat the effect of random initialization.The average and the standard deviation of the inception scores are reported.
The proposed regularization is also compared and contrasted in terms inception score with many benchmark methods as summarized in Table 1.It can be again observed that DiReAL was able to improve the image generation quality compared to unregularized counterpart and when combined with spectral normalization, we observed a 6% improvement in the inception score.By combining DiReAL with layer normalization, an improvement of 11.68% on inception was observed.However, no significant improvement was observed when DiReAL was combined with batch normalization and weight normalization.It must be remarked that the calculation of Inception Scores is library dependent and that is why the scores reported in Table 1 is different for those reported by Miyato et al. [24].While our implementation was in PyTorch, [24] was in Chainer2 .
In the next set of large-scale experiments, STL-10 dataset was used to train generator under diversity regularization and compared with other state-of-the-art regularization techniques.As can be observed in Fig. 6, images synthesized by generator trained with DiReAL was able to generate images with competitive quality in comparison with other regularization methods considered.Performance of DiReAL was also observed to be competitive to regularization methods such as WGAN-GP and spectral normalization.In Fig. 7 we show the images produced by the generators trained with DiReAL using Celeb-A dataset.It can be again be observed that DiReAL was able to stabilize the training and avoid mode collapse in comparison to the unregularized counterpart.

Conclusion
This paper proposes an interesting and effective method of stabilizing the training of GANs using diversity regularization to penalize both negatively and positively correlated features according to features differentiation and based on features relative cosine distances.It has been shown that diversity regularization can help alleviate a common failure mode where the generator collapses to a single parameter configuration and outputs identical points.This has been achieved by providing additional stable diversity gradient information in addition to adversarial gradient information to update both the generator and discriminator's features.The performance of the proposed regularization in terms of extracting diverse features and improving adversarial learning was compared on the basis of image synthesis with recent regularization techniques namely batch normalization, layer normalization, weight normalization, weight clipping, WGAN-GP, and spectral normalization.It has also been shown on select examples that extraction of diverse features improves the quality of image generation, especially when used in combination with spectral normalization.This concept is illustrated using MNIST handwritten digits, CIFAR-10, STL-10, and Celeb-A Dataset.

Fig. 2 .
Fig. 2. Diversity loss of (a) generator JG with no regularization (b) generator JG with diReAL (c) discriminator JD with no regularization, and (d) discriminator JD with DiReAL trained on MNIST dataset.

Fig. 3 .Fig. 4 .
Fig. 3. (a) Divergence, as measured by Wasserstein distance, between the discriminator output for synthesized and real MNIST samples (b) Synthesized hand-written digits with and without diversity regularization.

Fig. 6 .
Fig. 6.Qualitative comparison of generated images with four regularization techniques for models trained on STL-10 dataset.

Fig. 7 .
Fig. 7. Generated images with and without diversity Regularization trained on CELEB-A dataset.

Table 1 .
Inception Scores with unsupervised image generation on CIFAR-10