An empirical analysis of computational optimal training of language models

In recent years, the focus in language modeling has been on improving performance by increasing the number of parameters in transformer-based models. This approach has led to impressive results and state-of-the-art performance in many natural language processing tasks.

We’ve also followed this line of research at DeepMind and recently introduced Gopher, a 280 billion parameter model that established top performance across a wide range of tasks, including language modeling, reading comprehension, and question answering. Since then, an even larger model called the Megatron-Turing NLG with 530 billion parameters has been published.

Due to the significant cost of training these large models, it is paramount to estimate the best possible training setup to avoid wasting resources. In particular, the computational cost of training transformers is determined by two factors: the size of the model and the number of training tokens.

The current generation of large language models has allocated increased computing resources to increase the number of parameters of the large models and keep the size of the training data constant at around 300 billion tokens. In this work, we empirically investigate the optimal trade-off between increasing model size and training data volume with increasing computational resources. Specifically, we ask the question: “What is the optimal model size and number of training tokens for a given computational budget?” To answer this question, we train models of various sizes and with various numbers of chips and empirically estimate this trade-off.
Our key finding is that current large language models are too large for their computational budget and are not trained on enough data. In fact, we find that for the number of training FLOPs used for training Gophera 4x smaller model trained on 4x more data would be preferable.

We test our data scaling hypothesis through training Chinchilla, a 70 billion parameter model trained on 1.3 trillion tokens. While the computational cost of training Chinchilla and Gopher is the same, we find that it outperforms Gopher and other large language models in almost every task measured, even though it has 70 billion parameters compared to Gopher’s 280 billion.

After the release of Chinchilla, a model named PaLM was released with 540 billion parameters and trained on 768 billion tokens. This model was trained with approximately 5 times the computational budget of Chinchilla and outperformed Chinchilla on a number of tasks. While the training corpus is different, our methods predict that such a model trained on our data would outperform Chinchilla even though it was computationally suboptimal. Given the computational budget of PaLM, we predict that a 140 billion parameter model trained on 3 trillion tokens will be optimal and more efficient for inference.

An additional advantage of smaller, more efficient models is that inference time and memory cost are reduced, making model search faster and possible on less hardware. In practice, while the training FLOPs between Gopher and Chinchilla are the same, the cost of using Chinchilla is significantly lower, in addition to its performance. Further simple optimizations may be possible that can continue to provide large gains.

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Why harmonize bank statements? Explain the importance and benefits

Que sont les règles métier ? : The wizard is not complete

Training AI music models is about to get very expensive

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Top 35+ Finance Interview Questions And Answers

An empirical analysis of computational optimal training of language models

AI companies are finally being forced to cough up training data

Understanding the visual knowledge of language models | MIT News

Natural Language Processing (NLP) and its Applications in AI

Training AI music models is about to get very expensive

How Bend Works: A Parallel Programming Language That “Feels Like Python but Scales Like CUDA” | by Lucas de Lima Nogueira | Jun, 2024

Helping nonexperts build advanced generative AI models | MIT News

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Meta drops ‘3D Gen’ bomb: AI-powered 3D asset creation at lightning speed

Our Picks

How Forex Trading Robots Are Transforming Financial Markets

U.S. Awards $504 Million for ‘Tech Hubs’ in Overlooked Regions

Meta drops ‘3D Gen’ bomb: AI-powered 3D asset creation at lightning speed

Subscribe to Updates

An empirical analysis of computational optimal training of language models

Related Posts