Boosting Llama 3.1 405B Throughput 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink

Choosing Parallelism for Deployment

Both tensor parallel (TP) and pipeline parallel (PP) techniques increase compute and memory capacity by splitting models across multiple GPUs, but they differ in how they impact performance. Pipeline parallelism is a low-overhead mechanism for efficiently increasing overall throughput, while tensor parallelism is a higher-overhead mechanism for reducing latency.

Tensor and Pipeline Parallelism Explained

Tensor parallelism (TP) splits the execution of each model layer across multiple GPUs. Every calculation is distributed across available GPUs, and each GPU performs its own portion of the calculation. Then, every GPU broadcasts its individual results, known as partial sums, to every other GPU using an AllReduce operation.

Pipeline parallelism (PP) operates by splitting groups of model layers – or stages – across available GPUs. A request will begin on one GPU and will continue execution across subsequent stages on subsequent GPUs. With PP, communication only occurs between adjacent stages, rather than between all GPUs like with TP execution.

GPU-to-GPU Bandwidth with and without NVSwitch

On the top, 8 GPUs are connected to each other with a centralized NVSwitch. Diagram shows 8 GPUs on the bottom, each with links going to every other GPU.

NVLink Switch Helps Maximize High-Throughput Performance

Each NVIDIA Hopper architecture GPU incorporates 18 NVLinks with each providing 50 GB/s of bandwidth per direction, providing a total of 900 GB/s of NVLink bandwidth. Each HGX H100 8-GPU or H200 server features four NVLink Switches. During TP model execution across eight GPUs, each GPU communicates to every other GPU using seven, equal-bandwidth connections. This means that communication across any connection happens at 1/7th of NVLink bandwidth, or about 128 GB/s.

Choosing Parallelism

Choosing parallelism is about finding the right balance between compute and capacity for the target scenario. NVLink Switch provides developers with the flexibility to select the optimal parallelism configuration leading to better performance than what is possible with either a single GPU, or across multiple GPUs with tensor parallelism alone.

Conclusion

The NVIDIA platform provides developers with a full technology stack to optimize generative AI inference performance. NVIDIA Hopper architecture GPUs – available from every major cloud and server maker – connected with the high-bandwidth, NVLink and NVLink Switch AI fabric, and running TensorRT-LLM software provide outstanding performance for the latest LLMs.

Frequently Asked Questions

Q: What is tensor parallelism?
A: Tensor parallelism is a technique that splits the execution of each model layer across multiple GPUs, distributing calculations and broadcasting results.

Q: What is pipeline parallelism?
A: Pipeline parallelism is a technique that operates by splitting groups of model layers – or stages – across available GPUs, with communication only occurring between adjacent stages.

Q: What is NVLink Switch?
A: NVLink Switch is a high-bandwidth interconnect that provides a total of 900 GB/s of NVLink bandwidth, enabling efficient communication between GPUs.

Q: Why is NVLink Switch important?
A: NVLink Switch provides developers with the flexibility to select the optimal parallelism configuration, leading to better performance than what is possible with either a single GPU, or across multiple GPUs with tensor parallelism alone.

Post Views: 66

Boosting Llama 3.1 405B Throughput 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter