Date:

NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project

Here is the organized content:

NVIDIA Open-Source Initiatives

NVIDIA has a rich history of open-source initiatives. NVIDIA engineers have released over 900 software projects on GitHub and have open-sourced essential components of the AI software stack. The NVIDIA Triton Inference Server, for example, is now integrated into all major cloud service providers to serve AI models in production. Additionally, NVIDIA engineers are actively involved in numerous open-source foundations and standards bodies, including the Linux Foundation, the Python Software Foundation, and the PyTorch Foundation.

Meeting Data Center Compute Demands

The compute power required to train autoregressive transformer models has exploded, growing by a staggering 20,000x over the last 5 years. Meta’s Llama 3.1 405B model, launched earlier this year, required 38 billion petaflops of accelerated compute to train, 50x more than the Llama 2 70B model launched only a year earlier. Training and serving these large models cannot be managed on a single GPU; rather, they must be parallelized across massive GPU clusters.

Importance of Multi-GPU Interconnect

One common challenge that arises with model parallelism is the high volume of GPU-to-GPU communication. Tensor parallel GPU communication patterns highlight just how interconnected these GPUs are. For example, with AllReduce, every GPU has to send the results of its calculation to every other GPU at every layer of the neural network before the final model output is determined. Any latency during these communications can lead to significant inefficiencies, with GPUs left idle, waiting for the communication protocols to complete. This reduces overall system efficiency and increases the total cost of ownership (TCO).

NVLink Cartridges

To enable high-speed communication between all 72 NVIDIA Blackwell GPUs in the NVLink domain, we implemented a novel design featuring four NVLink cartridges mounted vertically at the rear of the rack. These cartridges accommodate over 5,000 active copper cables, delivering an impressive aggregate All-to-All bandwidth of 130 TB/s and 260 TB/s AllReduce bandwidth.

Liquid Cooling Manifolds and Floating Blind Mates

To efficiently manage the 120 KW cooling capacity required for the rack, we’ve implemented direct liquid cooling techniques. Building upon existing designs, we’ve introduced two key innovations. First, we developed an enhanced Blind Mate Liquid Cooling Manifold design, capable of delivering efficient cooling. Second, we created a novel Floating Blind Mate Tray connection, which effectively distributes coolant to both compute and switch trays, significantly improving the ability of the liquid quick disconnects to align and reliably mate in the rack.

Compute and Switch Tray Mechanical Form Factors

To accommodate the high compute density of the rack, we introduced 1RU liquid-cooled compute and switch tray form factors. We also developed a new, denser DC-SCM (Data Center Secure Control Module) design that’s 10% smaller than the current standard. In addition, we implemented a narrower bus bar connector to maximize available rear panel space. These modifications optimize space utilization while maintaining performance.

New Joint NVIDIA GB200 NVL72 Reference Architecture

At OCP, NVIDIA also announced a new joint GB200 NVL72 reference architecture with Vertiv, a leader in power and cooling technologies and expert in designing, building, and servicing high compute density data centers. This new reference architecture will significantly reduce implementation time for CSPs and data centers deploying the NVIDIA Blackwell platform.

Conclusion

The NVIDIA GB200 NVL72 design represents a significant milestone in the evolution of modern high compute density data centers. By addressing the pressing challenges of training and serving growing AI models and high GPU-to-GPU communication, this contribution accelerates the adoption of energy-efficient high compute density platforms in the data center while reinforcing the importance of collaboration within the open ecosystem. We’re excited to see how the OCP community will leverage and build on top of the GB200 NVL72 design contributions.

FAQs

Q: What is the NVLink domain size and speed in the GB200 NVL72 design?
A: The NVLink domain can now support up to 72 NVIDIA Blackwell GPUs, with a communication speed of 1.8 TB/s per GPU, 36x faster than state-of-the-art 400 Gbps Ethernet standards.

Q: What is the aggregate All-to-All bandwidth of the NVLink domain in the GB200 NVL72 design?
A: The aggregate All-to-All bandwidth is 260 TB/s.

Q: What is the thermal management solution used in the GB200 NVL72 design?
A: The GB200 NVL72 design uses direct liquid cooling techniques, including an enhanced Blind Mate Liquid Cooling Manifold design and a novel Floating Blind Mate Tray connection.

Q: What is the compute and switch tray mechanical form factor in the GB200 NVL72 design?
A: The GB200 NVL72 design features 1RU liquid-cooled compute and switch tray form factors.

Q: Who is the partner that NVIDIA collaborated with to create the new joint GB200 NVL72 reference architecture?
A: NVIDIA collaborated with Vertiv, a leader in power and cooling technologies and expert in designing, building, and servicing high compute density data centers.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here