Cost-Effective User Throughput
Businesses are often challenged with balancing the performance and costs of inference workloads. While some customers or use cases may work with an out-of-the-box or hosted model, others may require customization. NVIDIA technologies simplify model deployment while optimizing cost and performance for AI inference workloads. In addition, customers can experience flexibility and customizability with the models they choose to deploy.
- NVIDIA NIM inference microservices are prepackaged and performance-optimized for rapidly deploying AI foundation models on any infrastructure — cloud, data centers, edge or workstations.
- NVIDIA Triton Inference Server, one of the company’s most popular open-source projects, allows users to package and serve any model regardless of the AI framework it was trained on.
- NVIDIA TensorRT is a high-performance deep learning inference library that includes runtime and model optimizations to deliver low-latency and high-throughput inference for production applications.
Available in all major cloud marketplaces, the NVIDIA AI Enterprise software platform includes all these solutions and provides enterprise-grade support, stability, manageability and security.
Cloud-Based LLM Inference
To ease LLM deployment, NVIDIA has collaborated closely with every major cloud service provider to ensure that the NVIDIA inference platform can be seamlessly deployed in the cloud with minimal or no code required.
- Amazon SageMaker AI, Amazon Bedrock Marketplace, Amazon Elastic Kubernetes Service
- Google Cloud’s Vertex AI, Google Kubernetes Engine
- Microsoft Azure AI Foundry coming soon, Azure Kubernetes Service
- Oracle Cloud Infrastructure’s data science tools, Oracle Cloud Infrastructure Kubernetes Engine
Plus, for customized inference deployments, NVIDIA Triton Inference Server is deeply integrated into all major cloud service providers.
Transforming Agreement Management With Docusign
Docusign, a leader in digital agreement management, turned to NVIDIA to supercharge its Intelligent Agreement Management platform. With over 1.5 million customers globally, Docusign needed to optimize throughput and manage infrastructure expenses while delivering AI-driven insights.
NVIDIA Triton provided a unified inference platform for all frameworks, accelerating time to market and boosting productivity by transforming agreement data into actionable insights. Docusign’s adoption of the NVIDIA inference platform underscores the positive impact of scalable AI infrastructure on customer experiences and ROI. This transformation highlights how robust AI infrastructure can revolutionize financial services.
Serving 400 Million Search Queries Monthly With Perplexity AI
Perplexity AI, an AI-powered search engine, handles over 435 million monthly queries. Each query represents multiple AI inference requests. To meet this demand, the Perplexity AI team turned to NVIDIA H100 GPUs, Triton Inference Server and TensorRT-LLM.
Supporting over 20 AI models, including Llama 3 variations like 8B and 70B, Perplexity processes diverse tasks such as search, summarization and question-answering. By using smaller classifier models to route tasks to GPU pods, managed by NVIDIA Triton, the company delivers cost-efficient, responsive service under strict service level agreements.
Elevating Creative Workflows With Let’s Enhance
To optimize its workflows using the Stable Diffusion XL model in production, Let’s Enhance, a pioneering AI startup, chose the NVIDIA AI inference platform.
Lets Enhance’s latest product, AI Photoshoot, uses the SDXL model to transform plain product photos into beautiful visual assets for e-commerce websites and marketing campaigns.
With NVIDIA Triton’s robust support for various frameworks and backends, coupled with its dynamic batching feature set, Let’s Enhance was able to seamlessly integrate the SDXL model into existing AI pipelines with minimal involvement from engineering teams, freeing up their time for research and development efforts.
Unlocking the Full Potential of AI Inference With Hardware Innovation
Improving the efficiency of AI inference workloads is a multifaceted challenge that demands innovative technologies across hardware and software.
NVIDIA GPUs are at the forefront of AI enablement, offering high efficiency and performance for AI models. They’re also the most energy efficient: NVIDIA accelerated computing on the NVIDIA Blackwell architecture has cut the energy used per token generation by 100,000x in the past decade for inference of trillion-parameter AI models.
The NVIDIA Grace Hopper Superchip, which combines NVIDIA Grace CPU and Hopper GPU architectures using NVIDIA NVLink-C2C, delivers substantial inference performance improvements across industries.
Meta Andromeda is using the superchip for efficient and high-performing personalized ads retrieval. By creating deep neural networks with increased compute complexity and parallelism, on Facebook and Instagram it has achieved an 8% ad quality improvement on select segments and a 6% recall improvement.
With optimized retrieval models and low-latency, high-throughput and memory-IO aware GPU operators, Andromeda offers a 100x improvement in feature extraction speed compared to previous CPU-based components. This integration of AI at the retrieval stage has allowed Meta to lead the industry in ads retrieval, addressing challenges like scalability and latency for a better user experience and higher return on ad spend.
As cutting-edge AI models continue to grow in size, the amount of compute required to generate each token also grows. To run state-of-the-art LLMs in real-time, enterprises need multiple GPUs working in concert. Tools like the NVIDIA Collective Communication Library, or NCCL, enable multi-GPU systems to quickly exchange large amounts of data between GPUs with minimal communication time.
Conclusion
NVIDIA’s advancements in inference software optimization and the NVIDIA Hopper platform are helping industries serve the latest generative AI models, delivering excellent user experiences while optimizing total cost of ownership. The company’s full-stack software optimization approach offers the key to improving AI inference performance and achieving this goal.
Frequently Asked Questions
Q: How does NVIDIA optimize AI inference for cost and performance?
A: NVIDIA optimizes AI inference using its full-stack software optimization approach, which combines world-class silicon, systems, and software for high-throughput and low-latency inference and enables great user experiences while reducing cost.
Q: Which cloud providers collaborate with NVIDIA to ease LLM deployment?
A: NVIDIA has collaborated closely with every major cloud service provider to ensure seamless deployment in the cloud with minimal or no code required, including Amazon, Google Cloud, Microsoft, and Oracle Cloud Infrastructure.
Q: Can NVIDIA Triton Inference Server be used with any AI framework?
A: Yes, NVIDIA Triton Inference Server is a framework-agnostic solution that allows users to package and serve any model, regardless of the AI framework it was trained on.