What Defines LLM Red Teaming in Practice?
LLM red teaming is an activity where people provide inputs to generative AI technologies, such as large language models (LLMs), to see if the outputs can be made to deviate from acceptable standards. This use of LLMs began in 2023 and has rapidly evolved to become a common industry practice and a cornerstone of trustworthy AI.
What Defines LLM Red Teaming in Practice?
LLM red teaming has the following defining characteristics:
- It’s limit-seeking: Red teamers find boundaries and explore limits in system behavior.
- It’s never malicious: People doing red teaming are not interested in doing harm – in fact, quite the opposite.
- It’s manual: Being a creative and playful practice, the parts of red teaming that can be automated are often most useful to give human red teamers insight for their work.
- It’s a team effort: Practitioners find inspiration in each other’s techniques and prompts, and the norm is to respect fellow practitioners’ work.
- It’s approached with an alchemist mindset: We found that red teamers tend to abandon rationalizations about models and their behavior and instead embrace the chaotic and unknown nature of the work.
Why Do People Red Team LLMs?
People who attack LLMs have a broad range of motivations. Some of these are external, such as being part of their job or a regulatory requirement. Social systems can also play a role, with people discovering LLM vulnerabilities for social media content or to participate in a closed group. Others are intrinsic, as many people do it for fun, out of curiosity, or based on concerns for model behavior.
How Do People Approach This Activity?
LLM red teaming consists of using strategies to reach goals when conversationally attacking the target. Each kind of strategy is decomposed into different techniques. A technique might just affect two or three adversarial inputs against the targets, or an input might draw upon multiple techniques.
What Can LLM Red Teaming Reveal?
The goal of LLM red teaming isn’t to quantify security. Rather, the focus is on exploration, and finding which phenomena and behaviors a red teamer can get out of the LLM. Put another way, if we get a failure just one time, then the failure is possible.
How Do People Use Knowledge That Comes from LLM Red Teaming?
Red teamers are often looking for what they describe as harms that might be presented by an LLM. There’s a broad range of definitions of harm. A red teaming exercise could focus on one of many goals or targets, which could depend on deployment context, user base, data handled, or other factors.
NVIDIA’s Definition of LLM Red Teaming
We see LLM red teaming as an instance of AI red teaming. Our definition is developed by the NVIDIA AI Red Team and takes inspiration from both this research on LLM red teaming in practice and also the definition used by the Association for Computational Linguistics’ SIG on NLP Security (SIGSEC).
Improving LLM Security and Safety
NVIDIA NeMo Guardrails is a scalable platform for defining, orchestrating, and enforcing AI guardrails for content safety, jailbreak prevention, and more in AI agents and other generative AI applications.
Acknowledgements
Thanks to Nanna Inie, Jonathan Stray, and Leon Derczynski for their work on the Summon a demon and bind it: A grounded theory of LLM red teaming paper published in PLOS One.

