Date:

Nvidia Touts Lower ‘Time-to-First-Train’ with DGX Cloud on AWS

Nvidia’s DGX Cloud on AWS: Simplifying Generative AI Development and Deployment

Customers have a lot of options when it comes to building their generative AI stacks to train, fine-tune, and run AI models. In some cases, the number of options may be overwhelming. To help simplify the decision-making and reduce that all-important time it takes to train your first model, Nvidia offers DGX Cloud, which arrived on AWS last week.

The Power of DGX Systems

Nvidia’s DGX systems are considered the gold standard for GenAI workloads, including training large language models (LLMs), fine-tuning them, and running inference workloads in production. The DGX systems are equipped with the latest GPUs, including Nvidia H100 and H200s, as well as the company’s enterprise AI stack, like Nvidia Inference Microservices (NIMs), Riva, NeMo, and RAPIDS frameworks, among other tools.

DGX Cloud: A Managed Service for GenAI Development and Deployment

With its DGX Cloud offering, Nvidia is giving customers the array of GenAI development and production capabilities that come with DGX systems, but delivered via the cloud. It previously offered DGX Cloud on Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure, and last week at re:Invent 2024, it announced the availability of DGX Cloud on AWS.

Expertise and Support

Nvidia has a significant amount of experience building these AI pipelines on a variety of different types of infrastructure, and it shares that experience with customers through its DGX Cloud service. That allows it to cut down on the complexity the customer is exposed to, thereby accelerating the GenAI development and deployment lifecycle, said Alexis Bjorlin, the vice president of DGX Cloud at Nvidia.

Outcome-Based Capabilities

With DGX Cloud, Nvidia can also provide expert assistance in some of the finer aspects of model development, such as optimizing the training routines, Bjorlin said. Sometimes, customers want more efficient training, so they want to move from FP16 or BF16 to FP8. Maybe it’s the quantization of the data? How do you take and train a model and shard it across the infrastructure using four types of parallelism, whether it’s data parallel pipeline, model parallel, or expert parallel.

H100 GPUs on EC2 P5 Instances

With DGX Cloud running on AWS, Nvidia is supporting H100 GPUs running on EC2 P5 instances (in the future, it will be supported on the new P6 instances that AWS announced at the conference). That will give customers of all sizes the processing oomph to train, fine-tune, and run some of the largest LLMs.

Flexibility and Scalability

AWS has a variety of types of customers using DGX Cloud. It has a few very large companies using it to train foundation models, and a larger number of smaller firms fine-tuning pre-trained models using their own data, Bjorlin said. Nvidia needs to maintain the flexibility to accommodate all of them.

Conclusion

Nvidia’s DGX Cloud on AWS simplifies generative AI development and deployment by providing a managed service that offers the best of the best. With its expertise and support, customers can accelerate their GenAI development and deployment lifecycle, and achieve outcome-based capabilities. The flexibility and scalability of DGX Cloud make it an attractive option for customers of all sizes.

Frequently Asked Questions

Q: What is DGX Cloud?
A: DGX Cloud is a managed service offered by Nvidia that provides the array of GenAI development and production capabilities that come with DGX systems, but delivered via the cloud.

Q: What are the benefits of using DGX Cloud?
A: The benefits of using DGX Cloud include accelerating the GenAI development and deployment lifecycle, achieving outcome-based capabilities, and gaining expert assistance in model development.

Q: What types of customers use DGX Cloud?
A: AWS has a variety of types of customers using DGX Cloud, including large companies training foundation models and smaller firms fine-tuning pre-trained models using their own data.

Q: What is the flexibility and scalability of DGX Cloud?
A: The flexibility and scalability of DGX Cloud make it an attractive option for customers of all sizes, allowing them to accommodate their specific needs and scale up or down as needed.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here