Date:

Measuring Effectiveness of AI Guardrails in Generative AI Applications

Evaluating AI Guardrails Configuration with NeMo Guardrails

NeMo Guardrails offers a robust evaluation methodology that uses policy-based guardrails to enforce the desired behavior of your AI agent or chatbot assistant. At the core of this approach is the idea that each guardrail configuration should be designed to align with a set of well-defined policies, whether it’s preventing toxic content, ensuring on-topic responses, or delivering factually correct information.

Defining Evaluation Policies

For this example, consider the ABC bot, a simple RAG chatbot with a knowledge base composed of company information. The behavior of the chatbot is defined in the policies.yml file with the following policies:

Creating the Interactions Dataset

We’ve curated an in-house dataset of 215 interactions, with approximately 10% being multi-turn interactions. These multi-turn interactions offer insights into the dialogue dynamics by capturing extended exchanges between the user and the chatbot. Each interaction includes an expected_output attribute that specifies the desired response, for example, refusal when evaluating an input-moderation policy. To build these interaction sets, both synthetic data generation and real, expert-annotated data are effective approaches. Synthetic generation offers a straightforward method when annotated datasets are unavailable, though it requires iterative refinement and filtering, while real data ensures the highest level of relevance and accuracy.

Using the LLM as a Judge

A powerful LLM can serve as an effective judge for computing the policy compliance rate by determining if the actual responses adhere to the expected outputs. To ensure high accuracy in this automatic evaluation, we recommend establishing strong, clear rules for the LLM-as-a-judge, running the judge multiple times on the same dataset to check for inconsistencies, and validating results with a subset of manual annotations for each policy.

Evaluation Workflow

Figure 1 shows how user requests flow through various components of the evaluation tool. Starting from the user, the query is routed through to NeMo Guardrails for initial input processing. The latest AI safeguard NVIDIA NIM microservices are integrated into NeMo Guardrails to analyze the user request for content safety, topic control, and jailbreak detection.

Evaluation Results

Table 2 shows the latency per second, policy violation detection, and average throughput (tokens per second per interaction) for each guardrail configuration. Figure 2 shows a clear upward trend in evaluated policy violation detection rates across the guardrail configurations. As additional or more complex guardrails are applied, the system better adheres to the defined policies, with 75% of the policy violation detection with no guardrails to roughly 99% with the three safeguard NIM microservices integrated—a noticeable improvement of 33%.

Conclusion

NeMo Guardrails provides a robust framework for creating, managing, and evaluating AI guardrails in real-world applications. By defining clear policies, curating realistic interaction datasets, and leveraging both automated (LLM-as-a-judge) and manual evaluation methods, you gain actionable insights into policy compliance rates, resource usage, and latency impacts. The architectural flow underscores how these components interact—from user queries through guardrail checks to final policy compliance analysis, while the plots reveal the natural trade-offs between increasing policy compliance and rising latency. Ultimately, by iteratively refining guardrail configurations and balancing performance objectives, your organization can deploy AI systems that are not only accurate and safe but also responsive and cost-effective.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here