Date:

NVIDIA AI Blueprint for Cost-Efficient LLM Routing

Accelerating Large Language Models with the NVIDIA AI Blueprint for LLM Router

Introduction

Since the release of ChatGPT in November 2022, the capabilities of large language models (LLMs) have surged, and the number of available models has grown exponentially. With this expansion, LLMs now vary widely in cost, performance, and specialization. For AI developers and MLOps teams, the challenge lies in selecting the right model for each prompt—balancing accuracy, performance, and cost.

The Challenge of Model Selection

A one-size-fits-all approach is inefficient, leading to either unnecessary expenses or suboptimal results. To solve this, the NVIDIA AI Blueprint for an LLM router provides an accelerated, cost-optimized framework for multi-LLM routing. It seamlessly integrates NVIDIA tools and workflows to dynamically route prompts to the most suitable LLM, offering a powerful foundation for enterprise-scale LLM operations.

Key Features of the LLM Router

  • Configurable: Easily integrates with foundational models, including NVIDIA NIM and third-party LLMs.
  • High-performance: Built with Rust and powered by NVIDIA Triton Inference Server, ensuring minimal latency compared to direct model queries.
  • OpenAI API-compliant: Acts as a drop-in replacement for existing OpenAI API-based applications.
  • Flexible: Includes default routing behavior and enables fine-tuning based on business needs.

Prerequisites

To deploy the LLM router, ensure your system meets the following requirements:

  • Operating System: Linux (Ubuntu 22.04 or later)
  • Hardware: NVIDIA V100 GPU (or newer) with 4 GB of memory
  • Software:
    • CUDA and NVIDIA Container Toolkits
    • Docker and Docker Compose
    • Python
  • API keys:
    • NVIDIA NGC API key
    • NVIDIA API catalog key

Deploying and Managing the LLM Router

  1. Deploy the LLM router
    Follow the blueprint notebook to install the necessary dependencies and run the LLM router services using Docker Compose.
  2. Test the routing behavior
    Make a request to the LLM router using the sample Python code or the sample web application. The LLM router handles the request by acting as a reverse proxy:

    • LLM router receives the request and parses the payload
    • LLM router forwards the parsed payload to a classification model
    • Model returns a classification
    • LLM router forwards the payload to a LLM based on the classification
    • LLM router proxies the results from the LLM back to the user

Customizing the Router

Follow the instructions in the blueprint to change the routing policy and LLMs. By default, the blueprint includes examples for routing based on task classification or complexity classification. Fine-tuning a custom classification model is demonstrated in the customization template notebooks.

Monitoring Performance

Run a load test by following the instructions in the blueprint’s load test demonstration. The router captures metrics that can be viewed in a Grafana dashboard.

Multiturn Routing Example

One of the key capabilities of the LLM router is the ability to handle multiturn conversations by sending each new query to the best LLM. This ensures that each request is handled optimally while maintaining context across different types of tasks. An example is outlined below:

  1. User Prompt 1: "A farmer needs to transport a wolf, a goat, and a cabbage across a river. The boat can only carry one item at a time. If left alone together, the wolf will eat the goat, and the goat will eat the cabbage. How can the farmer safely transport all three items across the river?"
    Complexity Router → Chosen Classifier: Reasoning
  2. User Prompt 2: "Resolve this problem using graph theory. Define nodes as valid states (for example, FWGC-left) and edges as permissible boat movements. Formalize the solution as a shortest-path algorithm."
    Complexity Router → Chosen Classifier: Domain-Knowledge
  3. User Prompt 3: "Analyze how Step 2 in your solution specifically prevents the wolf-cabbage conflict you mentioned in Step 4. Use your original step numbering to trace dependencies between these actions."
    Complexity Router → Chosen Classifier: Constraint
  4. User Prompt 4: "Based on the above, write a scientific fiction story."
    Complexity Router → Chosen Classifier: Creativity
  5. User Prompt 5: "Now summarize the above in a short and concise manner."
    Task Router → Chosen Classifier: Summarization

Conclusion

Implementing the NVIDIA AI Blueprint for an LLM router enables organizations to ensure high performance and accuracy in responses to specific user intents while maintaining the flexibility of plug-and-play model scaling. Cost savings are also achieved compared to the baseline approach of routing all requests to the most sophisticated model.

Reducing Costs

By matching simple tasks with smaller, more efficient models, you significantly reduce operational costs while maintaining fast response times.

Boosting Performance

More complex queries are routed to the best-fit models, ensuring the highest accuracy and efficiency.

Scaling Seamlessly

Whether you need open-source models, closed-source models, or a mix of both, the blueprint provides the flexibility to scale and adapt to your organization’s needs.

Get Started

Experience this blueprint now through NVIDIA Launchables. View the full source code in the NVIDIA-AI-Blueprints/llm-router GitHub repo. To learn more about router classification models, read about the NVIDIA NeMo Curator Prompt Task and Complexity Classifier.

FAQs

  • Q: What are the system requirements for deploying the LLM router?
    A: The system requires a Linux operating system (Ubuntu 22.04 or later), an NVIDIA V100 GPU (or newer) with 4 GB of memory, CUDA and NVIDIA Container Toolkits, Docker and Docker Compose, Python, an NVIDIA NGC API key, and an NVIDIA API catalog key.
  • Q: How does the LLM router handle multiturn conversations?
    A: The LLM router sends each new query to the best LLM, ensuring that each request is handled optimally while maintaining context across different types of tasks.
  • Q: Can I customize the routing policy and LLMs?
    A: Yes, follow the instructions in the blueprint to change the routing policy and LLMs. By default, the blueprint includes examples for routing based on task classification or complexity classification.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here