Transcending Scaling Laws with 0.1% Extra Compute
Oct 2022
Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani
[Google]
https://arxiv.org/abs/2210.11399
缩放语言模型提高了性能,但也带来了巨大的计算成本。本文提出了UL2R,这是一种通过相对少量的额外计算显著改进现有语言模型及其缩放曲线的方法。关键思想是继续训练最先进的大型语言模型(例如,PaLM),并使用UL2的混合去噪器目标执行更多步骤。我们表明,在几乎可以忽略不计的额外计算成本和没有新数据源的情况下,我们能够大幅改善大型语言模型在下游度量上的缩放属性。在本文中,我们继续使用UL2R训练PaLM,引入了一组8B、62B和540B规模的新模型,我们称之为U-PaLM。令人印象深刻的是,在540B规模下,我们显示了大约2倍的计算节省率,其中U-PaLM以大约一半的计算预算实现了与最终PaLM 540B模型相同的性能(即,节省约440万TPUv4小时)。我们进一步表明,这种改进的缩放曲线导致了在挑战BIG Bench任务上的“emergent abilities”——例如,U-PaLM在某些任务上比PaLM做得好得多,或者在小得多的规模上表现出更好的质量(62B而不是540B)。总体而言,我们发现U-PaLM在许多少样本设置上优于PaLM,即英语NLP任务(如常识推理、问题解答)、具有思维链的推理任务(如GSM8K)、多语言任务(MGSM、TydiQA)、MMLU和具有挑战性的BIG Bench任务。最后,我们提供了定性示例,展示了U-PaLM用于单跨和多跨填充的新功能。
Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving ∼4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.