什么是高质量数据?
1万亿个tokens的科研论文。再加上1.8万亿tokens的书。
“When we talk about high-quality language data, we usually refer to stuff like research papers, where there’s roughly 1 trillion tokens. And then there’s another 1.8 trillion tokens in books,” says Niklas Muennighoff, research engineer at generative AI startup Contextual AI. That estimate points to a model half the size of Gopher that would exhaust the stack of high-quality data.
这些数据不需要那么大的模型。半个Gopher那么大就够了。Gopher的大小是280B。
自产自销
“自产自销”可能会降低模型的表现。
但是可以大模型产生教科书训练小模型。
除了英文以外,其他语言没有那么多高质量数据可用。[1]
转折点! 你只需要教科书(2023年6月)!
Samuel Albanie, assistant professor of applied machine learning at the U.K.’s University of Cambridge, sees a turning point in work Microsoft published in June last year. This paper claimed, in an echo of the title of the 2017 Google paper that introduced Transformers, “textbooks are all you need.”
TinyStory
其实几个月(2023年4月)之前就有一篇,同样是微软的。用儿童读物训练语言模型,数据量不大,模型也不大(不到10M参数)。