Date:

NVIDIA Dynamo: Low-Latency Distributed Inference Framework

Accelerating AI Inference in Multinode Deployments

Accelerating AI Inference in Multinode Deployments

AI inference will help developers create new groundbreaking applications by integrating reasoning models into their workflows, allowing apps to understand and interact with users in more intuitive ways. However, it also represents a significant recurring cost, posing considerable challenges for those looking to scale their models cost efficiently to meet the insatiable demand for AI.

Introducing NVIDIA Dynamo

NVIDIA announced the release of NVIDIA Dynamo today at GTC 2025. NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. The framework boosts the number of requests served by up to 30x, when running the open-source DeepSeek-R1 models on NVIDIA Blackwell. NVIDIA Dynamo is compatible with open-source tools, including PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM, joining the expanding community of inference tools that empower developers and AI researchers to accelerate AI.

Key Innovations in NVIDIA Dynamo

NVIDIA Dynamo introduces several key innovations, including:

  • Disaggregated prefill and decode inference stages to increase throughput per GPU
  • Dynamic scheduling of GPUs based on fluctuating demand to optimize performance
  • LLM-aware request routing to avoid KV cache recomputation costs
  • Accelerated asynchronous data transfer between GPUs to reduce inference response time
  • KV cache offloading across different memory hierarchies to increase system throughput

Get Started with NVIDIA Dynamo

Modern LLMs have expanded in parameter size, incorporated reasoning capabilities, and are increasingly embedded in agentic AI workflows. As a result, they generate a far greater number of tokens during inference and require deployment in distributed environments driving up costs. Therefore, optimizing inference-serving strategies to lower costs and support seamless scaling in distributed environments is crucial.

Developers deploying new generative AI models can start today with the ai-dynamo/dynamo GitHub repo. AI Inference developers and researchers are invited to contribute to NVIDIA Dynamo on GitHub. Join the new NVIDIA Dynamo Discord Server, the official NVIDIA server for developers and users of NVIDIA Dynamo, a distributed inference framework.

Conclusion

NVIDIA Dynamo is a groundbreaking open-source inference serving framework that enables developers to accelerate AI inference in multinode deployments. With its innovative features and modular architecture, NVIDIA Dynamo is poised to revolutionize the way AI models are deployed and scaled in large-scale distributed environments.

FAQs

Q: What is NVIDIA Dynamo?
A: NVIDIA Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments.

Q: What are the key innovations in NVIDIA Dynamo?
A: NVIDIA Dynamo introduces several key innovations, including disaggregated prefill and decode inference stages, dynamic scheduling of GPUs, LLM-aware request routing, accelerated asynchronous data transfer, and KV cache offloading.

Q: How can I get started with NVIDIA Dynamo?
A: Developers can start today with the ai-dynamo/dynamo GitHub repo. AI Inference developers and researchers are invited to contribute to NVIDIA Dynamo on GitHub. Join the new NVIDIA Dynamo Discord Server, the official NVIDIA server for developers and users of NVIDIA Dynamo, a distributed inference framework.

Q: How does NVIDIA Dynamo compare to NVIDIA Triton?
A: NVIDIA Dynamo is the successor to NVIDIA Triton, building on its success and offering a new modular architecture designed to serve generative AI models in multinode distributed environments.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here