Date:

Generate single title from this title Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM in 100 -150 characters. And it must return only title i dont want any extra information or introductory text with title e.g: ” Here is a single title:”

Write an article about

Researchers from the University College London (UCL) Deciding, Acting, and Reasoning with Knowledge (DARK) Lab leverage NVIDIA NIM microservices in their new game-based benchmark suite, Benchmarking Agentic LLM and VLM Reasoning On Games (BALROG). BALROG was specifically designed to evaluate the agentic capabilities of models on challenging, long-horizon interactive tasks using a diverse set of game environments.

The team from the DARK lab has leveraged NVIDIA NIM to simplify their exhaustive benchmarking process. They were able to use DeepSeek-R1, an enormous 671-billion-parameter model, at the time of the DeepSeek-R1 NIM release (end of February 2025). This approach accelerated their work, as they did not have to first deploy and host the model locally.

This post explores how NVIDIA NIM is enabling efficient benchmarking of advanced AI models using BALROG. We share insights into the benchmarking process, key results, and how NIM microservices are advancing the evaluation of agentic AI reasoning in state-of-the-art AI systems.

NVIDIA NIM for DeepSeek-R1 

NVIDIA NIM microservices are quickly redefining how researchers and developers deploy and scale AI models, offering a streamlined approach to harnessing the power of GPUs. These microservices simplify the process of running AI inference workloads by providing pre-optimized engines such as NVIDIA TensorRT and NVIDIA TensorRT-LLM, which deliver low-latency, high-throughput performance. 

What makes NIM microservices particularly exciting for researchers is their flexibility. They can be deployed across cloud platforms, data centers, or even local workstations, enabling seamless integration into diverse workflows. With support for Kubernetes-based scaling, researchers can efficiently handle workloads of any size, from small experiments to large-scale deployments. 

NIM microservices also empower users to self-host models securely and customize them for specific needs, making them a versatile solution for applications such as natural language processing, computer vision, and scientific research. Additionally, NIM can be deployed on national supercomputing centers, allowing researchers to leverage high-performance infrastructure for large-scale AI workloads and enabling secure research on private or sensitive data.

The microservices offer easy and fast API integration with standard frontends like the OpenAI API or LangChain for Python environments. Node.js and command-line access are also possible. This enables researchers to efficiently run large state-of-the-art open-source large language models (LLMs), even when resources are limited.

With the announcement of DeepSeek-R1 in January 2025, NVIDIA provided ready-to-use NIM microservices for the various DeepSeek models. That enabled researchers from UCL to immediately evaluate the largest variant with 671 billion parameters through build.nvidia.com. DeepSeek-R1 was assumed an ideal candidate for benchmarking with BALROG, due to its advanced reasoning capabilities and the possibility of tackling long-horizon challenges.

BALROG methodology

While LLMs and vision-language models (VLMs) show remarkable progress in processing information and following instructions, their ability to act effectively in complex, dynamic situations remains a key challenge. Tasks requiring sustained planning, spatial awareness, and adapting to unforeseen circumstances often push beyond their current capabilities. 

Many existing benchmarks, while useful, tend to focus on shorter interactions or static problems. This can lead to rapidly saturated results and test data leakage, and may not fully capture the essential skills needed for robust, real-world agency, like long-term decision-making. The BALROG benchmark suite was developed for this growing need of a more demanding evaluation method that truly tests AI capacity for extended reasoning and interaction by using games.

BALROG aggregates six distinct reinforcement learning environments into a unified testbed, assessing agentic skills across varying complexities (Figure 1):

  • Crafter: A Minecraft-inspired 2D grid environment requiring exploration, resource gathering, and item crafting for survival.
  • Baba Is AI: A puzzle game where agents manipulate word blocks representing rules to alter how objects interact and solve puzzles.
  • NetHack Learning Environment (NLE): The classic roguelike known for its extreme difficulty and complexity, demanding long-term strategic planning and short-term tactics.
  • MiniHack: A multitask framework built on NLE, assessing exploration, navigation, long-term planning, and resource management through diverse tasks.
  • BabyAI: A simple, 2D grid-world testing natural language instruction following for tasks of varying complexity.
  • TextWorld: An entirely text-based game requiring exploration and natural language interaction, featuring no visual component.

Screenshots of the six game environments used in BALROG: top left Crafter, top centre BabaIsAI, top right NetHack, bottom left TextWorld, bottom center BabyAI, bottom right MiniHack.
Figure 1. The six game environments used in BALROG, clockwise from top left: Crafter, BabaIsAI, NetHack, MiniHack, BabyAI, and TextWorld

To ensure that models genuinely reason and adapt rather than simply rely on memorized patterns, procedural generation across environments is used. BALROG provides a standardized framework with the aim to rigorously assess a large set of different models on how well they perform on these demanding tasks. This directs development towards more capable and autonomous AI agents. 

BALROG results

BALROG aims to benchmark a vast range of modern language models by tracking them on its leaderboard. Agents receive observations of the environment either as natural language descriptions, or in multimodal vision-language formats, and are tasked to output the next action in natural language. Models like DeepSeek-R1 that were trained specifically as reasoning models are allowed to reason before outputting an action.

BALROG uses a standardized metric scoring performance on each task from 0 to 100. For environments with discrete goals (BabyAI, Baba Is AI, MiniHack), scores are binary (0 for failure, 100 for success). For environments with more granular progress (TextWorld, Crafter, NetHack), the score represents the proportion of objectives achieved or milestones reached. 

The researchers behind BALROG evaluated the DeepSeek-R1 NIM through the OpenAI API, which enabled a seamless switch. Their evaluation shows that DeepSeek-R1 achieves a new state-of-the-art performance on BALROG with an average progression of 34.9% ± 2.1%, edging out the previous leader, Claude 3.5 Sonnet 32.6% ± 1.9%. This placed the model at the top of the leaderboard at the time of writing. Thanks to seamless integration of the NIM with standard APIs, it was possible to effortlessly query DeepSeek-R1. This is a feat close to impossible for most academic researchers due to the sheer size of the full model.

Further analysis of progression against API cost, compared to various other models, shows how DeepSeek R1 offers very high performance at a lower cost by serving it with NVIDIA NIM (Figure 2).

Chart showing BALROG score on the Y axis, and average API cost per episode on X axis for various popular LLMs.
Figure 2. Performance versus cost per episode of BALROG

Conclusion

NVIDIA NIM has lowered the effort to access and use modern LLMs and VLMs. The wide range of available APIs makes it easy to integrate them into existing environments, like BLAROG. What is more, NIM microservices can immediately be used remotely in the cloud or deployed locally if compute resources are available. Due to cloud-based use, researchers from the DARK lab did not need to deploy the model locally. Instead, they could instantly use the latest, and one of the largest state-of-the-art models upon its release. 

To learn more about the BALROG methodology, see the ICLR 2025 paper, BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games. The researchers also plan to benchmark NVIDIA Llama Nemotron Ultra and Llama 4 models that are available as NIM microservices.

To get started with NVIDIA NIM to deploy, evaluate, and scale state-of-the-art AI models using industry-standard APIs, visit NVIDIA NIM for Developers.

.Organize the content with appropriate headings and subheadings ( h2, h3, h4, h5, h6). Include conclusion section and FAQs section with Proper questions and answers at the end. do not include the title. it must return only article i dont want any extra information or introductory text with article e.g: ” Here is rewritten article:” or “Here is the rewritten content:”

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here