Choosing Parallelism for Deployment
Both tensor parallel (TP) and pipeline parallel (PP) techniques increase compute and memory capacity by splitting models across multiple GPUs, but they differ in how they impact performance. Pipeline parallelism is a low-overhead mechanism for efficiently increasing overall throughput, while tensor parallelism is a higher-overhead mechanism for reducing latency.
Tensor and Pipeline Parallelism Explained
Tensor parallelism (TP) splits the execution of each model layer across multiple GPUs. Every calculation is distributed across available GPUs, and each GPU performs its own portion of the calculation. Then, every GPU broadcasts its individual results, known as partial sums, to every other GPU using an AllReduce operation.
Pipeline parallelism (PP) operates by splitting groups of model layers – or stages – across available GPUs. A request will begin on one GPU and will continue execution across subsequent stages on subsequent GPUs. With PP, communication only occurs between adjacent stages, rather than between all GPUs like with TP execution.
GPU-to-GPU Bandwidth with and without NVSwitch
On the top, 8 GPUs are connected to each other with a centralized NVSwitch. Diagram shows 8 GPUs on the bottom, each with links going to every other GPU.
NVLink Switch Helps Maximize High-Throughput Performance
Each NVIDIA Hopper architecture GPU incorporates 18 NVLinks with each providing 50 GB/s of bandwidth per direction, providing a total of 900 GB/s of NVLink bandwidth. Each HGX H100 8-GPU or H200 server features four NVLink Switches. During TP model execution across eight GPUs, each GPU communicates to every other GPU using seven, equal-bandwidth connections. This means that communication across any connection happens at 1/7th of NVLink bandwidth, or about 128 GB/s.
Choosing Parallelism
Choosing parallelism is about finding the right balance between compute and capacity for the target scenario. NVLink Switch provides developers with the flexibility to select the optimal parallelism configuration leading to better performance than what is possible with either a single GPU, or across multiple GPUs with tensor parallelism alone.
Conclusion
The NVIDIA platform provides developers with a full technology stack to optimize generative AI inference performance. NVIDIA Hopper architecture GPUs – available from every major cloud and server maker – connected with the high-bandwidth, NVLink and NVLink Switch AI fabric, and running TensorRT-LLM software provide outstanding performance for the latest LLMs.
Frequently Asked Questions
Q: What is tensor parallelism?
A: Tensor parallelism is a technique that splits the execution of each model layer across multiple GPUs, distributing calculations and broadcasting results.
Q: What is pipeline parallelism?
A: Pipeline parallelism is a technique that operates by splitting groups of model layers – or stages – across available GPUs, with communication only occurring between adjacent stages.
Q: What is NVLink Switch?
A: NVLink Switch is a high-bandwidth interconnect that provides a total of 900 GB/s of NVLink bandwidth, enabling efficient communication between GPUs.
Q: Why is NVLink Switch important?
A: NVLink Switch provides developers with the flexibility to select the optimal parallelism configuration, leading to better performance than what is possible with either a single GPU, or across multiple GPUs with tensor parallelism alone.

