Improved Training of Wasserstein GANs翻译 下
Improved Training of Wasserstein GANs
改进了Wasserstein GAN的训练
论文:http://arxiv.org/pdf/1704.00028v3.pdf
Abstract
摘要
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms. †
生成性对抗网络(GAN)是强大的生成模型,但受到训练不稳定性的影响。最近提出的Wasserstein GAN(WGAN)在GAN的稳定训练方面取得了进展,但有时仍然只会产生不良样本或无法收敛。我们发现这些问题通常是由于在WGAN中使用权重限制来对评论家强制执行Lipschitz约束,这可能导致不期望的行为。我们提出了裁剪权重的替代方法:惩罚评论家关于其输入的渐变范数。我们提出的方法比标准WGAN表现更好,并且能够在几乎没有超参数调整的情况下对各种GAN架构进行稳定的训练,包括101层ResNets和带有连续生成器的语言模型。我们还在CIFAR-10和LSUN卧室上实现了高品质的世代。 †
1 Introduction
1简介
Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to finding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.
生成性对抗网络(GAN)[9]是一类强大的生成模型,它将生成建模作为两个网络之间的游戏:生成器网络在给定一些噪声源的情况下生成合成数据,鉴别器网络区分生成器的输出和真实数据。GAN可以产生非常具有视觉吸引力的样本,但通常难以训练,并且最近关于该主题的大部分工作[23,19,2,21]致力于寻找稳定训练的方法。尽管如此,持续稳定的GAN培训仍然是一个悬而未决的问题。
In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.
特别是,[1]提供了由GAN优化的值函数的收敛性质的分析。他们提出的替代方案,名为Wasserstein GAN(WGAN)[2],利用Wasserstein距离产生一个值函数,该函数具有比原始值更好的理论性质。WGAN要求鉴别者(在该工作中称为评论家)必须位于1-Lipschitz函数的空间内,作者通过权重削减强制执行。
Our contributions are as follows:
我们的贡献如下:
1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.
1.在玩具数据集上,我们展示了批评者权重裁剪如何导致不良行为。
2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.
我们提出了梯度惩罚(WGAN-GP),它没有遇到同样的问题。
3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.
3.我们展示了各种GAN架构的稳定培训,重量削减的性能改进,高质量的图像生成,以及没有任何离散采样的字符级GAN语言模型。
∗Now at Google Brain
*现在谷歌大脑
†Code for our models is available at https://github.com/igul222/improved_wgan_training.
†我们的型号代码可在https://github.com/igul222/improved_wgan_training获得。
2 Background
2背景
2.1 Generative adversarial networks
2.1生成对抗网络
The GAN training strategy is to define a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.
GAN培训策略是定义两个竞争网络之间的游戏。生成器网络将噪声源映射到输入空间。鉴别器网络接收生成的样本或真实的数据样本,并且必须区分这两者。训练发生器以愚弄鉴别器。
Formally, the game between the generator G and the discriminator D is the minimax objective:
形式上,生成器G和鉴别器D之间的游戏是最小极大目标:
(the input z to the generator is sampled from some simple noise distribution p, such as the uniform distribution or a spherical Gaussian distribution).
其中隐含定义的模型分布(生成器的输入z是从一些简单的噪声分布p中采样的,例如均匀分布或球形高斯分布)。
If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between2.2 Wasserstein GANs
2.2 Waterstone GAN
[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training difficulty. They propose instead using the Earth-Mover (also called Wasserstein-1) distanceThe WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain
使用Kantorovich-Rubinstein对偶[25]构造WGAN值函数以获得
The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].
WGAN值函数产生一个批评函数,其相对于其输入的梯度比其GAN对应物表现得更好,使得生成器的优化更容易。根据经验,还观察到WGAN值函数似乎与样本质量相关,而GAN则不然[2]。
To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie k-Lipschitz functions for some k which depends on c and the critic architecture. In the following within a compact space [, c]. The set of functions satisfying this constraint is a subset of the sections, we demonstrate some of the issues with this approach and propose an alternative.
为了对评论家强制执行Lipschitz约束,[2]建议将评论者的权重剪辑为k-Lipschitz函数,取决于c和评论体系结构。在紧凑的空间内[,c]。满足此约束的函数集是这些部分的子集,我们演示了此方法的一些问题并提出了替代方案。
2.3 Properties of the optimal WGAN critic
2.3最佳WGAN评论家的属性
In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.
为了理解为什么体重削减在WGAN评论家中存在问题,以及激励我们的方法,我们强调了WGAN框架中最佳评论家的一些属性。我们在附录中证明了这些。
Proposition 1. Let3 Difficulties with weight constraints
3重量限制的差异
We find that weight clipping in WGAN leads to optimization difficulties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.
我们发现WGAN中的权重削减会导致优化困难,即使优化成功,最终的批评者也可能具有病态价值表面。我们在下面解释这些问题并展示它们的影响;但是我们并不是说每个人总是在实践中出现,也不是说他们是唯一这样的机制。
Our experiments use the specific form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.
我们的实验使用[2]中的特殊形式的权重约束(每个权重的大小的硬限幅),但我们也尝试了其他权重约束(L2范数裁剪,权重归一化)以及软约束(L1和L2权重)腐烂)并发现他们表现出类似的问题。
To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.
在某种程度上,这些问题可以通过评论家中的批量标准化来减轻,[2]在他们的所有实验中使用。然而,即使批量归一化,我们也观察到非常深刻的WGAN批评者经常无法收敛。
8 Gaussians 25 Gaussians Swiss Roll (a) Value surfaces of WGAN critics trained to op(b) (left) Gradient norms of deep WGAN critics dur timality on toy datasets using (top) weight clipping ing training on the Swiss Roll dataset either explode and (bottom) gradient penalty. Critics trained with or vanish when using weight clipping, but not when weight clipping fail to capture higher moments of the using a gradient penalty. (right) Weight clipping (top) data distribution. The ‘generator’ is held fixed at the pushes weights towards two values (the extremes of real data plus Gaussian noise. the clipping range), unlike gradient penalty (bottom).
8高斯25高斯瑞士卷(a)WGAN评论家的价值表面训练为op(b)(左)深度WGAN评论家的渐变规范在玩具数据集上使用瑞士Roll数据集上的(顶部)权重裁剪训练的爆发性和爆炸性(下)梯度惩罚。批评者在使用体重削减时训练或消失,但是当体重削减无法捕捉使用渐变惩罚的更高时刻时。 (右)重量削减(顶部)数据分布。与梯度罚分(底部)不同,“发生器”保持固定在推动权重两个值(实际数据的极值加上高斯噪声,削波范围)。
Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.
图1:WGAN中的梯度惩罚没有表现出像重量削减那样的不良行为。
3.1 Capacity underuse
3.1未充分使用的能力
Implementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under§This assumption is in order to exclude the case when the matching point of sample x is x itself. It is
§这个假设是为了排除样本x的匹配点是x本身的情况。它是
satisfied in the case thatcritic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.
评论家。在每种情况下,使用权重裁剪训练的评论家忽略了数据分布的更高时刻,而是模拟对最优函数的非常简单的近似。相反,我们的方法不会受到这种行为的影响。
3.2 Exploding and vanishing gradients
3.2爆炸和消失梯度
We observe that the WGAN optimization process is difficult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold c.
我们观察到WGAN优化过程是困难的,因为权重约束和成本函数之间的相互作用,这导致消失或爆炸的梯度,而不仔细调整限幅阈值c。
To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c in [文章引用于 http://tongtianta.site/paper/3418
编辑 Lornatang
校准 Lornatang