Large Language Model (LLM) training has become increasingly popular in the last year with the release of several publicly available models such as Llama2, Falcon and StarCoder. Customers are now training LLMs of unprecedented size ranging from 1 billion to over 175 billion parameters. Training these LLMs requires significant computing resources and time, as hundreds to thousands of graphics processing units (GPUs) must be used to handle today’s massive training datasets and model sizes. A bottleneck in distributed training can be GPU communication handled by the NVIDIA Collective Communication Library (NCCL). In some large-scale training tasks, more time can be spent on inter-GPU communication than actual GPU computation. To alleviate the GPU communication problem and enable faster training, Amazon SageMaker is excited to announce an optimized AllGather collective function as part of the SageMaker Distributed Data Parallel Data Library (SMDDP). AllGather is the most used collective function in popular memory-efficient data parallelization solutions such as DeepSpeed Zero Redundancy Optimizer (ZeRO) and Fully Fragmented Data Parallelism (FSDP), and is the main contributor to GPU communication overhead. In this post, we show a high-level overview of how SMDDP works, how you can enable SMDDP in your Amazon SageMaker training scenarios, and the performance improvements you can expect.
Solution overview
Traditional parallel data training involves running an entire model on multiple GPUs, with each model trained on different fragments of data from the dataset. During the backward pass, gradients are averaged across GPU workers so that each model copy is updated with the same gradient values, even though they were trained with different data fragments. This technique enables much faster training on huge data sets by parallelizing the consumption of the training data. However, some of today’s large models (eg Llama2 70B) are too large to fit entirely in GPU memory, which makes traditional data parallelization useless. To continue to enjoy the benefits of data parallelism while overcoming limited GPU memory, data parallel solutions such as DeepSpeed ZeRO, PyTorch FSDP, and the Amazon SageMaker model parallelization library have grown in popularity.
In distributed data parallelization, instead of replicating the entire model on worker GPUs, the model parameters, gradients, and optimization states are distributed and distributed (ie, shared) among the GPUs in the training job. To perform forward and backward traversal computation, parameters are aggregated from fragments on other GPU workers to form one or more model layers. After the computation is performed, these layers are then freed from memory to allow the next set of layers to be assembled. Note that there are variants of partitioned data parallelism where only the states and gradients of the optimizer are partitioned, but not the model parameters. AllGather is still used in this type of shared data parallelization, but only before the forward pass is computed, in order to aggregate model parameters that have been updated by different gradient or optimization state fragments from other GPU workers. See different DeepSpeed ZeRO Stages and SHARD_GRAD_OP
FSDP Sharing Strategy for more details.
An AllGather collective function is executed whenever the parameters are removed – NCCL provides the standard open source implementation of this routine. As shown below, each GPU worker participating in AllGather starts with an input buffer and ends up with all input buffers from other workers merged together. When AllGather is used in shared data parallelism, the input buffers contain the model parameter fragments and the large output buffers contain one or more model layers implemented by the other fragments.
Although NCCL is commonly used for AllGather in distributed training, its underlying low-level implementation is not adapted to the networking infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) instances, and thus its performance can slow down end-to-end training . The SMDDP library is a collaborative GPU communication library from NVIDIA that serves as a replacement for NCCL and provides better performance for distributed training tasks with PyTorch. Specifically, SMDDP provides an optimized implementation of AllGather for p4d/p4de case types.
Since collective operations such as AllGather block forward and backward traversal computation, faster execution of these operations directly translates into shorter end-to-end training time with no side effects on convergence. Other less commonly used collective functions in parallel training with shared data are handled by returning to NCCL.
Walking
AllGather optimized for AWS
AWS-optimized AllGather uses the following techniques to achieve better performance on the AWS infrastructure compared to NCCL:
- We move data between instances over an Elastic Fabric Adapter (EFA) network with an all-to-all communication pattern. EFA is AWS’s low-latency, high-performance network solution, and an all-to-all pattern for inter-node communication is more tailored to the characteristics of the EFA and AWS network infrastructure by requiring fewer packet hops compared to the ring or NCCL tree pattern communication.
- GDRCopy to coordinate local NVLink and EFA network traffic. GDRCopy is a library that provides low-latency communication between CPU processes and CUDA GPU cores. With this technology, we are able to route data transfer within and between nodes.
- Reduced use of GPU stream multiprocessors to give more computing power to model cores. AWS P4d/P4de instances are equipped with NVIDIA A100 GPUs, each with 108 stream multiprocessors. While NCCL needs up to 24 stream multiprocessors to run collectives, SMDDP Collectives use only up to nine stream multiprocessors. Stored stream multiprocessors can be taken from model compute cores for faster execution.
Use
SMDDP collectives integrate natively with PyTorch through process group removal to torch.distributed
measurement unit. A process group defines the interfaces for common collective functions such as AllGather, ReduceScatter, AllReduce, etc. Users can write generic distributed code and then choose the subject backend
, which provides the implementation for these functions based on the computing device being used. CPU training jobs often use the gloo
the mpi
backend while NVIDIA GPUs use the nccl
back end.
The SMDDP library is shown in the image by registering it as a custom backend in the process group abstraction. This is done with the import statement, which is shown in the code snippets below. Then when choosing the backend for your GPU-based distributed training task, just replace nccl
with smddp
. The smddp
backend adheres to the same semantics as the nccl
backend and supports the same training scenarios.
DeepSpeed
import smdistributed.dataparallel.torch.torch_smddp
deepspeed.init_distributed(dist_backend="smddp") # replacing "nccl"
FSDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # replacing "nccl"
Reference points
We compared the standalone AllGather performance where the collective function is performed individually without any model training. Below is a sample result on 32 p4d instances comparing NCCL and SMDDP AllGather. The X-axis represents the output size of AllGather and the Y-axis represents the network usage rate of p4d’s 400 Gbps EFA network. The 4 sub-graphs represent the common communication group patterns where we have 1, 2, 4 and 8 classes per p4d instance participating in the AllGather function, respectively.
These microbenchmarks show that SMDDP outperforms NCCL in two key features:
- The peak throughput of SMDDP (about 90% bandwidth utilization) is higher than that of NCCL (about 80% bandwidth utilization) in all configurations.
- SMDDP achieves maximum performance at much smaller buffer sizes than NCCL. This especially improves training speeds for smaller models or when the user sets a small AllGather buffer size in DeepSpeed (where the AllGather size does not need to be equal to the layer size).
Standard training benchmarks
In large-scale training tasks where GPU communication is a significant bottleneck, SMDDP can significantly improve training speeds as measured by the TFLOPS/GPU model.
Configuration | Implementation | ||||
Model/Training | Complex | Sharded Data Parallelism solution | TFLOPS/GPU model with NCCL | TFLOPS/GPU model with SMDDP | % speed up |
13B Lama2 Sequence length: 4096 Global lot size: 4 million chips |
64 p4d.24xlarge nodes (512 GPU NVIDIA A100) | PyTorch FSDP | 97.89 | 121.85 | 24.40% |
65B GPT-NeoX Sequence length: 2048 Global lot size: 4 million chips |
64 p4d.24xlarge nodes (512 GPU NVIDIA A100) | DeepSpeed ZeRO Stage 3* | 99.23 | 108.66 | 9.50% |
*EleutherAI’s Megatron-DeepSpeed repository was used. Tensor parallelism was also enabled with a tensor-parallel degree of eight.
Note: The TFLOPS/GPU model is based on the Model FLOPS Utilization calculation defined in the document here and benchmarks elsewhere may refer to hardware TFLOPS/GPU as a performance metric. TFLOPS/GPU hardware can be approximated as 4/3 x TFLOPS/GPU model.
conclusion
In this post, we showed you how to significantly speed up parallel training jobs with shared data in Amazon SageMaker with just two lines of code change. Large-scale distributed education is becoming increasingly ubiquitous with the advent or LLMs, but at this scale it comes at a high cost. By reducing communication congestion between GPUs, SMDDP helps you train faster at scale and save computing resources. More SMDDP examples with parallel training with shared data can be found at Amazon SageMaker Examples GitHub Repository.
About the Authors
Apoorv Gupta is a Software Development Engineer at AWS, focused on building optimal deep learning systems for AWS infrastructure and hardware. He is interested in distributed computing, deep learning systems and ML accelerators. Outside of work, Apoorv enjoys traveling, hiking and video games.
Karan Dhiman is a Software Development Engineer at AWS, based in Toronto, Canada. He is very passionate about the machine learning space and building solutions to accelerate distributed compute workloads.
Ruhan Prasad is a Software Development Engineer at AWS working to make distributed deep learning training faster, cheaper, and easier to use in SageMaker. Outside of work, Ruhan enjoys playing tennis, traveling and cooking.
Zhaoqi Zhu is a Senior Software Development Engineer at AWS, passionate about distributed systems and low-level optimizations. He likes to watch football games while drinking soda (non-diet).