In recent years, the focus in language modeling has been on improving performance by increasing the number of parameters in transformer-based models. This approach has led to impressive results and state-of-the-art performance in many natural language processing tasks.
We’ve also followed this line of research at DeepMind and recently introduced Gopher, a 280 billion parameter model that established top performance across a wide range of tasks, including language modeling, reading comprehension, and question answering. Since then, an even larger model called the Megatron-Turing NLG with 530 billion parameters has been published.
Due to the significant cost of training these large models, it is paramount to estimate the best possible training setup to avoid wasting resources. In particular, the computational cost of training transformers is determined by two factors: the size of the model and the number of training tokens.
The current generation of large language models has allocated increased computing resources to increase the number of parameters of the large models and keep the size of the training data constant at around 300 billion tokens. In this work, we empirically investigate the optimal trade-off between increasing model size and training data volume with increasing computational resources. Specifically, we ask the question: “What is the optimal model size and number of training tokens for a given computational budget?” To answer this question, we train models of various sizes and with various numbers of chips and empirically estimate this trade-off.
Our key finding is that current large language models are too large for their computational budget and are not trained on enough data. In fact, we find that for the number of training FLOPs used for training Gophera 4x smaller model trained on 4x more data would be preferable.
We test our data scaling hypothesis through training Chinchilla, a 70 billion parameter model trained on 1.3 trillion tokens. While the computational cost of training Chinchilla and Gopher is the same, we find that it outperforms Gopher and other large language models in almost every task measured, even though it has 70 billion parameters compared to Gopher’s 280 billion.
After the release of Chinchilla, a model named PaLM was released with 540 billion parameters and trained on 768 billion tokens. This model was trained with approximately 5 times the computational budget of Chinchilla and outperformed Chinchilla on a number of tasks. While the training corpus is different, our methods predict that such a model trained on our data would outperform Chinchilla even though it was computationally suboptimal. Given the computational budget of PaLM, we predict that a 140 billion parameter model trained on 3 trillion tokens will be optimal and more efficient for inference.
An additional advantage of smaller, more efficient models is that inference time and memory cost are reduced, making model search faster and possible on less hardware. In practice, while the training FLOPs between Gopher and Chinchilla are the same, the cost of using Chinchilla is significantly lower, in addition to its performance. Further simple optimizations may be possible that can continue to provide large gains.