Amazon SageMaker Inference Supports G6e Instances

G6e Instances Powered by NVIDIA’s L40S Tensor Core GPUs Now Available on Amazon SageMaker

Key Highlights

Twice the GPU memory compared to G5 and G6 instances, enabling deployment of large language models in FP16 up to 14B parameter model on a single GPU node (G6e.xlarge), 72B parameter model on a 4 GPU node (G6e.12xlarge), and 90B parameter model on an 8 GPU node (G6e.48xlarge)
Up to 400 Gbps of networking throughput
Up to 384 GB GPU Memory

Use Cases

G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real-time use cases such as:

Chatbots and conversational AI
Text generation and summarization
Image generation and vision models

Performance

In the following two figures, we see that for long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

[Figure 1]
[Figure 2]

In the following two figures, we see that G5.2xlarge throws a CUDA out of memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides great performance.

[Figure 3]
[Figure 4]

In the following two figures, we compare G5.48xlarge (8 GPU node) with the G6e.12xlarge (4 GPU) node, which costs 35% less and is more performant. For higher concurrency, we see that G6e.12xlarge provides 60% lower latency and 2.5 times higher throughput.

[Figure 5]
[Figure 6]

Deployment Walkthrough

Prerequisites

To try out this solution using SageMaker, you’ll need the following prerequisites:
- A SageMaker account
- A GPU-enabled instance (G6e.xlarge, G6e.12xlarge, or G6e.48xlarge)
- The necessary software and dependencies installed on your machine

Deployment

You can clone the repository and use the notebook provided here.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code: predictor.delete_predictor()

Conclusion

G6e instances on SageMaker unlock the ability to deploy a wide variety of open source models cost-effectively. With superior memory capacity, enhanced performance, and cost-effectiveness, these instances represent a compelling solution for organizations looking to deploy and scale their AI applications. The ability to handle larger models, support longer context lengths, and maintain high throughput makes G6e instances particularly valuable for modern AI applications. Try the code to deploy with G6e.

About the Authors

Vivek Gangasani: Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Alan Tan: Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.
Pavan Kumar Madduri: Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. He earned his MS in Information Technology from Arizona State University. Outside of work, he enjoys swimming and watching movies.
Michael Nguyen: Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

FAQs

Q: What are the key highlights of G6e instances?
A: The key highlights of G6e instances include twice the GPU memory compared to G5 and G6 instances, up to 400 Gbps of networking throughput, and up to 384 GB GPU memory.

Q: What are the use cases for G6e instances?
A: G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real-time use cases such as chatbots and conversational AI, text generation and summarization, and image generation and vision models.

Q: How do G6e instances perform compared to G5 instances?
A: Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances. For long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

Post Views: 78

Amazon SageMaker Inference Supports G6e Instances

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Startup helps retailers track their products in real-time | MIT News

Generate single title from this title Dozens of Red Hat packages backdoored through its official NPM channel in 100 -150 characters. And it must...

Ambassadors of STEM | MIT News

Generate single title from this title How districts can build a shared AI structure in 100 -150 characters. And it must return only title...

Generate single title from this title Training Azerbaijani language models on Amazon SageMaker AI in 100 -150 characters. And it must return only title...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title The Impact of Generative AI in Business: Key Insights in 100 -150 characters. And it must return only...

Generate single title from this title The most interesting startups right now want to get you off your phone in 100 -150 characters. And...

Generate single title from this title My students need real connection, not AI feedback in 100 -150 characters. And it must return only title...

Categories

Useful Links

Our Newsletter