Improved Training of Wasserstein GANs翻译[上]

Improved Training of Wasserstein GANs翻译下

Improved Training of Wasserstein GANs

改进了Wasserstein GAN的训练

论文：http://arxiv.org/pdf/1704.00028v3.pdf

Abstract

摘要

Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. We ﬁnd that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models with continuous generators. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms. †

生成性对抗网络（GAN）是强大的生成模型，但受到训练不稳定性的影响。最近提出的Wasserstein GAN（WGAN）在GAN的稳定训练方面取得了进展，但有时仍然只会产生不良样本或无法收敛。我们发现这些问题通常是由于在WGAN中使用权重限制来对评论家强制执行Lipschitz约束，这可能导致不期望的行为。我们提出了裁剪权重的替代方法：惩罚评论家关于其输入的渐变范数。我们提出的方法比标准WGAN表现更好，并且能够在几乎没有超参数调整的情况下对各种GAN架构进行稳定的训练，包括101层ResNets和带有连续生成器的语言模型。我们还在CIFAR-10和LSUN卧室上实现了高品质的世代。 †

1 Introduction

1简介

Generative Adversarial Networks (GANs) [9] are a powerful class of generative models that cast generative modeling as a game between two networks: a generator network produces synthetic data given some noise source and a discriminator network discriminates between the generator’s output and true data. GANs can produce very visually appealing samples, but are often hard to train, and much of the recent work on the subject [23, 19, 2, 21] has been devoted to ﬁnding ways of stabilizing training. Despite this, consistently stable training of GANs remains an open problem.

生成性对抗网络（GAN）[9]是一类强大的生成模型，它将生成建模作为两个网络之间的游戏：生成器网络在给定一些噪声源的情况下生成合成数据，鉴别器网络区分生成器的输出和真实数据。GAN可以产生非常具有视觉吸引力的样本，但通常难以训练，并且最近关于该主题的大部分工作[23,19,2,21]致力于寻找稳定训练的方法。尽管如此，持续稳定的GAN培训仍然是一个悬而未决的问题。

In particular, [1] provides an analysis of the convergence properties of the value function being optimized by GANs. Their proposed alternative, named Wasserstein GAN (WGAN) [2], leverages the Wasserstein distance to produce a value function which has better theoretical properties than the original. WGAN requires that the discriminator (called the critic in that work) must lie within the space of 1-Lipschitz functions, which the authors enforce through weight clipping.

特别是，[1]提供了由GAN优化的值函数的收敛性质的分析。他们提出的替代方案，名为Wasserstein GAN（WGAN）[2]，利用Wasserstein距离产生一个值函数，该函数具有比原始值更好的理论性质。WGAN要求鉴别者（在该工作中称为评论家）必须位于1-Lipschitz函数的空间内，作者通过权重削减强制执行。

Our contributions are as follows:

我们的贡献如下：

1. On toy datasets, we demonstrate how critic weight clipping can lead to undesired behavior.

1.在玩具数据集上，我们展示了批评者权重裁剪如何导致不良行为。

2. We propose gradient penalty (WGAN-GP), which does not suffer from the same problems.

我们提出了梯度惩罚（WGAN-GP），它没有遇到同样的问题。

3. We demonstrate stable training of varied GAN architectures, performance improvements over weight clipping, high-quality image generation, and a character-level GAN language model without any discrete sampling.

3.我们展示了各种GAN架构的稳定培训，重量削减的性能改进，高质量的图像生成，以及没有任何离散采样的字符级GAN语言模型。

∗Now at Google Brain

*现在谷歌大脑

†Code for our models is available at https://github.com/igul222/improved_wgan_training.

†我们的型号代码可在https://github.com/igul222/improved_wgan_training获得。

2 Background

2背景

2.1 Generative adversarial networks

2.1生成对抗网络

The GAN training strategy is to deﬁne a game between two competing networks. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator.

GAN培训策略是定义两个竞争网络之间的游戏。生成器网络将噪声源映射到输入空间。鉴别器网络接收生成的样本或真实的数据样本，并且必须区分这两者。训练发生器以愚弄鉴别器。

Formally, the game between the generator G and the discriminator D is the minimax objective:

形式上，生成器G和鉴别器D之间的游戏是最小极大目标：

image

where

image

is the data distribution and

image

is the model distribution implicitly deﬁned by

image

(the input z to the generator is sampled from some simple noise distribution p, such as the uniform distribution or a spherical Gaussian distribution).

其中

image

是数据分布，

image

是

image

隐含定义的模型分布（生成器的输入z是从一些简单的噪声分布p中采样的，例如均匀分布或球形高斯分布）。

If the discriminator is trained to optimality before each generator parameter update, then minimizing the value function amounts to minimizing the Jensen-Shannon divergence between

image

and

image

[9], but doing so often leads to vanishing gradients as the discriminator saturates. In practice, [9] way to circumvent this difﬁculty. However, even this modiﬁed loss function can misbehave in the advocates that the generator be instead trained to maximize

image

, which goes some presence of a good discriminator [1].

如果在每个发生器参数更新之前将鉴别器训练为最优性，则最小化值函数相当于最小化

image

和

image

之间的Jensen-Shannon散度[9]，但这样做通常会导致梯度消失，因为鉴别器饱和。在实践中，[9]方法来规避这种困难。然而，即使这种修改的损失函数也可能在倡导者中行为不端，即生成器被训练以最大化

image

，这会产生良好的鉴别器[1]。

2.2 Wasserstein GANs

2.2 Waterstone GAN

[2] argues that the divergences which GANs typically minimize are potentially not continuous with respect to the generator’s parameters, leading to training difﬁculty. They propose instead using the Earth-Mover (also called Wasserstein-1) distance

image

, which is informally deﬁned as the minimum cost of transporting mass in order to transform the distribution q into the distribution p (where the cost is mass times transport distance). Under mild assumptions,

image

is continuous everywhere and differentiable almost everywhere.

[2]认为，GAN通常最小化的分歧可能与发电机的参数不连续，导致训练困难。他们建议改为使用Earth-Mover（也称为Wasserstein-1）距离

image

，它被非正式地定义为运输质量的最低成本，以便将分布q转换为分布p（其中成本是质量乘以运输距离）。在温和的假设下，

image

在任何地方都是连续的，几乎无处不在。

The WGAN value function is constructed using the Kantorovich-Rubinstein duality [25] to obtain

使用Kantorovich-Rubinstein对偶[25]构造WGAN值函数以获得

image

where D is the set of 1-Lipschitz functions and

image

is once again the model distribution implicitly deﬁned by

image

. In that case, under an optimal discriminator (called a critic in the paper, since it’s not trained to classify), minimizing the value function with respect to the generator parameters minimizes

image

其中D是1-Lipschitz函数的集合，

image

再次是由

image

隐含定义的模型分布。在这种情况下，在最佳鉴别器（在论文中称为批评者，因为它没有经过分类训练）下，最小化关于生成器参数的值函数可以最小化

image

。

The WGAN value function results in a critic function whose gradient with respect to its input is better behaved than its GAN counterpart, making optimization of the generator easier. Empirically, it was also observed that the WGAN value function appears to correlate with sample quality, which is not the case for GANs [2].

WGAN值函数产生一个批评函数，其相对于其输入的梯度比其GAN对应物表现得更好，使得生成器的优化更容易。根据经验，还观察到WGAN值函数似乎与样本质量相关，而GAN则不然[2]。

To enforce the Lipschitz constraint on the critic, [2] propose to clip the weights of the critic to lie k-Lipschitz functions for some k which depends on c and the critic architecture. In the following within a compact space [

image

, c]. The set of functions satisfying this constraint is a subset of the sections, we demonstrate some of the issues with this approach and propose an alternative.

为了对评论家强制执行Lipschitz约束，[2]建议将评论者的权重剪辑为k-Lipschitz函数，取决于c和评论体系结构。在紧凑的空间内[

image

，c]。满足此约束的函数集是这些部分的子集，我们演示了此方法的一些问题并提出了替代方案。

2.3 Properties of the optimal WGAN critic

2.3最佳WGAN评论家的属性

In order to understand why weight clipping is problematic in a WGAN critic, as well as to motivate our approach, we highlight some properties of the optimal critic in the WGAN framework. We prove these in the Appendix.

为了理解为什么体重削减在WGAN评论家中存在问题，以及激励我们的方法，我们强调了WGAN框架中最佳评论家的一些属性。我们在附录中证明了这些。

Proposition 1. Let

image

and

image

be two distributions in X , a compact metric space. Then, there is a 1-Lipschitz function

image

which is the optimal solution of

image

. Let π be the optimal coupling between

image

and

image

, deﬁned as the minimizer of:

image

where

image

is the set of joint distributions

image

whose marginals are

image

and

image

, respectively. Then, if

image

is differentiable‡,

image

, and

image

with

image

, it holds that

image

. Corollary 1.

image

has gradient norm 1 almost everywhere under

image

and

image

命题1.让

image

和

image

成为X中的两个分布，一个紧凑的度量空间。然后，有一个1-Lipschitz函数

image

，这是

image

的最佳解决方案。设π是

image

和

image

之间的最佳耦合，定义为：

image

的最小化器，其中

image

是联合分布

image

的集合，其边缘分别为

image

和

image

。然后，如果

image

与

image

可区分‡，

image

和

image

，则它保持

image

。推论1.

image

在

image

和

image

下几乎无处不在，具有梯度范数1。

3 Difﬁculties with weight constraints

3重量限制的差异

We ﬁnd that weight clipping in WGAN leads to optimization difﬁculties, and that even when optimization succeeds the resulting critic can have a pathological value surface. We explain these problems below and demonstrate their effects; however we do not claim that each one always occurs in practice, nor that they are the only such mechanisms.

我们发现WGAN中的权重削减会导致优化困难，即使优化成功，最终的批评者也可能具有病态价值表面。我们在下面解释这些问题并展示它们的影响;但是我们并不是说每个人总是在实践中出现，也不是说他们是唯一这样的机制。

Our experiments use the speciﬁc form of weight constraint from [2] (hard clipping of the magnitude of each weight), but we also tried other weight constraints (L2 norm clipping, weight normalization), as well as soft constraints (L1 and L2 weight decay) and found that they exhibit similar problems.

我们的实验使用[2]中的特殊形式的权重约束（每个权重的大小的硬限幅），但我们也尝试了其他权重约束（L2范数裁剪，权重归一化）以及软约束（L1和L2权重）腐烂）并发现他们表现出类似的问题。

To some extent these problems can be mitigated with batch normalization in the critic, which [2] use in all of their experiments. However even with batch normalization, we observe that very deep WGAN critics often fail to converge.

在某种程度上，这些问题可以通过评论家中的批量标准化来减轻，[2]在他们的所有实验中使用。然而，即使批量归一化，我们也观察到非常深刻的WGAN批评者经常无法收敛。

image

8 Gaussians 25 Gaussians Swiss Roll (a) Value surfaces of WGAN critics trained to op(b) (left) Gradient norms of deep WGAN critics dur timality on toy datasets using (top) weight clipping ing training on the Swiss Roll dataset either explode and (bottom) gradient penalty. Critics trained with or vanish when using weight clipping, but not when weight clipping fail to capture higher moments of the using a gradient penalty. (right) Weight clipping (top) data distribution. The ‘generator’ is held ﬁxed at the pushes weights towards two values (the extremes of real data plus Gaussian noise. the clipping range), unlike gradient penalty (bottom).

8高斯25高斯瑞士卷（a）WGAN评论家的价值表面训练为op（b）（左）深度WGAN评论家的渐变规范在玩具数据集上使用瑞士Roll数据集上的（顶部）权重裁剪训练的爆发性和爆炸性（下）梯度惩罚。批评者在使用体重削减时训练或消失，但是当体重削减无法捕捉使用渐变惩罚的更高时刻时。（右）重量削减（顶部）数据分布。与梯度罚分（底部）不同，“发生器”保持固定在推动权重两个值（实际数据的极值加上高斯噪声，削波范围）。

image

Figure 1: Gradient penalty in WGANs does not exhibit undesired behavior like weight clipping.

图1：WGAN中的梯度惩罚没有表现出像重量削减那样的不良行为。

3.1 Capacity underuse

3.1未充分使用的能力

Implementing a k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in Corollary 1, the optimal WGAN critic has unit gradient norm almost everywhere under

image

and

image

; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.

通过权重削减实现k-Lipshitz约束使批评者偏向更简单的函数。如前面的推论1中所述，最佳WGAN评论家几乎在

image

和

image

下都有单位梯度范数;在权重限制约束下，我们观察到我们的神经网络架构试图获得其最大梯度范数k，最终学习极其简单的函数。

To demonstrate this, we train WGAN critics with weight clipping to optimality on several toy distributions, holding the generator distribution

image

ﬁxed at the real distribution plus unit-variance Gaussian noise. We plot value surfaces of the critics in Figure 1a. We omit batch normalization in the ‡We can actually assume much less, and talk only about directional derivatives on the direction of the line; which we show in the proof always exist. This would imply that in every point where

image

is differentiable (and thus we can take gradients in a neural network setting) the statement holds.

为了证明这一点，我们训练WGAN批评者在几个玩具分布上使用权重削减到最优，保持发生器分布

image

固定在实际分布加单位方差高斯噪声。我们绘制了图1a中评论家的价值面。我们在‡中省略了批量归一化。我们实际上可以假设更少，并且只讨论线路方向上的方向导数;我们在证明中表明的总是存在的。这意味着在

image

可区分的每个点（因此我们可以在神经网络设置中采用渐变）声明成立。

§This assumption is in order to exclude the case when the matching point of sample x is x itself. It is

§这个假设是为了排除样本x的匹配点是x本身的情况。它是

satisﬁed in the case that

image

and

image

have supports that intersect in a set of measure 0, such as when they are supported by two low dimensional manifolds that don’t perfectly align [1].

在

image

和

image

具有在一组度量0中相交的支持的情况下，例如当它们由两个不完全对齐的低维流形支持时，它们是令人满意的[1]。

image

critic. In each case, the critic trained with weight clipping ignores higher moments of the data distribution and instead models very simple approximations to the optimal functions. In contrast, our approach does not suffer from this behavior.

评论家。在每种情况下，使用权重裁剪训练的评论家忽略了数据分布的更高时刻，而是模拟对最优函数的非常简单的近似。相反，我们的方法不会受到这种行为的影响。

3.2 Exploding and vanishing gradients

3.2爆炸和消失梯度

We observe that the WGAN optimization process is difﬁcult because of interactions between the weight constraint and the cost function, which result in either vanishing or exploding gradients without careful tuning of the clipping threshold c.

我们观察到WGAN优化过程是困难的，因为权重约束和成本函数之间的相互作用，这导致消失或爆炸的梯度，而不仔细调整限幅阈值c。

To demonstrate this, we train WGAN on the Swiss Roll toy dataset, varying the clipping threshold c in [

image

], and plot the norm of the gradient of the critic loss with respect to successive layers of activations. Both generator and critic are 12-layer ReLU MLPs without batch normalization. Figure 1b shows that for each of these values, the gradient either grows or decays exponentially as we move farther back in the network. We ﬁnd our method results in more stable gradients that neither vanish nor explode, allowing training of more complicated networks.

为了证明这一点，我们在Swiss Roll玩具数据集上训练WGAN，改变[

image

，

image

，

image

]中的裁剪阈值c，并绘制关于连续激活层的评论者损失梯度的范数。发电机和评论家都是12层ReLU MLP，无需批量归一化。图1b显示，对于这些值中的每一个，随着我们在网络中向后移动，梯度会以指数方式增长或衰减。我们发现我们的方法产生更稳定的梯度，既不会消失也不会爆炸，从而可以训练更复杂的网络。

文章引用于 http://tongtianta.site/paper/3418
编辑 Lornatang
校准 Lornatang