Improved Training of Wasserstein GANs翻译 下
Improved Training of Wasserstein GANs
改进了Wasserstein GAN的训练
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms. †
生成性对抗网络(GAN)是强大的生成模型,但受到训练不稳定性的影响。最近提出的Wasserstein GAN(WGAN)在GAN的稳定训练方面取得了进展,但有时仍然只会产生不良样本或无法收敛。我们发现这些问题通常是由于在WGAN中使用权重限制来对评论家强制执行Lipschitz约束,这可能导致不期望的行为。我们提出了裁剪权重的替代方法:惩罚评论家关于其输入的渐变范数。我们提出的方法比标准WGAN表现更好,并且能够在几乎没有超参数调整的情况下对各种GAN架构进行稳定的训练,包括101层ResNets和带有连续生成器的语言模型。我们还在CIFAR-10和LSUN卧室上实现了高品质的世代。 †
1 Introduction
Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to finding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.
In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.
特别是,[1]提供了由GAN优化的值函数的收敛性质的分析。他们提出的替代方案,名为Wasserstein GAN(WGAN)[2],利用Wasserstein距离产生一个值函数,该函数具有比原始值更好的理论性质。WGAN要求鉴别者(在该工作中称为评论家)必须位于1-Lipschitz函数的空间内,作者通过权重削减强制执行。
Our contributions are as follows:
1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.
2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.
3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.
∗Now at Google Brain
†Code for our models is available at
2 Background
2.1 Generative adversarial networks
The GAN training strategy is to define a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.
Formally, the game between the generator G and the discriminator D is the minimax objective:
(the input z to the generator is sampled from some simple noise distribution p, such as the uniform distribution or a spherical Gaussian distribution).
If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between2.2 Wasserstein GANs
2.2 Waterstone GAN
[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training difficulty. They propose instead using the Earth-Mover (also called Wasserstein-1) distanceThe WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain
The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].
To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie k-Lipschitz functions for some k which depends on c and the critic architecture. In the following within a compact space [, c]. The set of functions satisfying this constraint is a subset of the sections, we demonstrate some of the issues with this approach and propose an alternative.
2.3 Properties of the optimal WGAN critic
In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.
Proposition 1. Let3 Difficulties with weight constraints
We find that weight clipping in WGAN leads to optimization difficulties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.
Our experiments use the specific form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.
To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.
8 Gaussians 25 Gaussians Swiss Roll (a) Value surfaces of WGAN critics trained to op(b) (left) Gradient norms of deep WGAN critics dur timality on toy datasets using (top) weight clipping ing training on the Swiss Roll dataset either explode and (bottom) gradient penalty. Critics trained with or vanish when using weight clipping, but not when weight clipping fail to capture higher moments of the using a gradient penalty. (right) Weight clipping (top) data distribution. The ‘generator’ is held fixed at the pushes weights towards two values (the extremes of real data plus Gaussian noise. the clipping range), unlike gradient penalty (bottom).
8高斯25高斯瑞士卷(a)WGAN评论家的价值表面训练为op(b)(左)深度WGAN评论家的渐变规范在玩具数据集上使用瑞士Roll数据集上的(顶部)权重裁剪训练的爆发性和爆炸性(下)梯度惩罚。批评者在使用体重削减时训练或消失,但是当体重削减无法捕捉使用渐变惩罚的更高时刻时。 (右)重量削减(顶部)数据分布。与梯度罚分(底部)不同,“发生器”保持固定在推动权重两个值(实际数据的极值加上高斯噪声,削波范围)。
Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.
3.1 Capacity underuse
Implementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under§This assumption is in order to exclude the case when the matching point of sample x is x itself. It is
satisfied in the case thatcritic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.
3.2 Exploding and vanishing gradients
We observe that the WGAN optimization process is difficult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold c.
To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c in [文章引用于
编辑 Lornatang
校准 Lornatang