Optimizing Large Language Models (LLMs) with a Serverless Read-Through Semantic Cache
Solution Overview
The cache in this solution acts as a buffer, intercepting prompts before they reach the main model. The semantic cache functions as a memory bank storing previously encountered similar prompts. It’s designed for efficiency and swiftly matching a user’s prompt with its closest semantic counterparts.
A Serverless Read-Through Semantic Cache Pattern
In this architecture, examples of cache miss and hit are shown in red and green, respectively. In this particular scenario, the client sends a query, which is then semantically compared to previously seen queries. The Lambda function, acting as the cache manager, prompts an LLM for a new generation due to a lack of cache hits given the similarity threshold.
Prerequisites
Amazon Bedrock users need to request access to FMs before they are available for use. This is a one-time action and takes less than a minute. For this solution, you’ll need one of the embedding models such as Cohere Embed-English on Amazon Bedrock or Amazon Titan Text Embedding.
Deploy the Solution
This solution entails setting up a Lambda layer that includes dependencies to interact with services like OpenSearch Serverless and Amazon Bedrock. A pre-built layer is compiled and added to a public Amazon Simple Storage Service (Amazon S3) prefix, available in the provided CloudFormation template.
Test the Solution
To test your cache using the Lambda console, open the Functions page. Navigate to the function you retrieved from the output of your stack. Set up a test event as illustrated in the following screenshot.
Latency Test Results
The following table summarizes the query latency test results without and with cache hit tested on Anthropic’s Claude V2.
Clean up
After you are done experimenting with the Lambda function, you can quickly delete all the resources you used to build this semantic cache, including your OpenSearch Serverless collection and Lambda function.
Conclusion
In this post, we walked you through the process of setting up a serverless read-through semantic cache. By implementing the pattern outlined here, you can elevate the latency of your LLM-based applications while simultaneously optimizing costs and enriching user experience.
Frequently Asked Questions
Q: What is a semantic cache?
A: A semantic cache is a cache that stores and retrieves semantically related queries.
Q: What is a serverless read-through semantic cache pattern?
A: A serverless read-through semantic cache pattern is an architecture that uses a Lambda function as a cache manager to prompt an LLM for a new generation based on a similarity threshold.
Q: What are the prerequisites for this solution?
A: Amazon Bedrock users need to request access to FMs before they are available for use.
Q: How do I deploy this solution?
A: This solution entails setting up a Lambda layer that includes dependencies to interact with services like OpenSearch Serverless and Amazon Bedrock. A pre-built layer is compiled and added to a public Amazon Simple Storage Service (Amazon S3) prefix, available in the provided CloudFormation template.
Q: How do I test this solution?
A: To test your cache using the Lambda console, open the Functions page. Navigate to the function you retrieved from the output of your stack. Set up a test event as illustrated in the following screenshot.
Q: What are the latency test results?
A: The following table summarizes the query latency test results without and with cache hit tested on Anthropic’s Claude V2.

