Benchmarking Customized Models on Amazon Bedrock Using LLMPerf and LiteLLM

Open Foundation Models and the Need for Performance Benchmarking

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Using Open Source Tools LLMPerf and LiteLLM for Performance Benchmarking

To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is its wide support of foundation model APIs, including LiteLLM, which supports all models available on Amazon Bedrock.

Setting up Your Custom Model Invocation with LiteLLM

LiteLLM is a versatile open-source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.

Configuring a Token Benchmark Test with LLMPerf

To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. LLMPerf.requests_launcher manages the distribution of requests across the Ray Clients, allowing for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.

Analyzing Performance Results from LLMPerf and Estimating Costs Using Amazon CloudWatch

LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end-user experience of your application.

Conclusion

While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance and compare models across key metrics such as cost, latency, and throughput.

Additional Resources

About the Authors

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.

Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

Post Views: 48

Benchmarking Customized Models on Amazon Bedrock Using LLMPerf and LiteLLM

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter