Curating Biological Findings with NVIDIA NIM

Scientific Papers are Highly Heterogeneous

Scientific papers are highly heterogeneous, often employing diverse terminologies for the same entities, using varied methodologies to study biological phenomena, and presenting findings within distinct contexts. Extracting meaningful insights from these papers requires a profound understanding of biology, a critical evaluation of methodologies, and the ability to discern robust findings from irrelevant or less reliable ones.

Scientists Must Carefully Interpret the Context

Scientists must carefully interpret the context, assess the reliability of experimental evidence, and identify potential biases or limitations in studies. Given the high-precision demands to support critical decision-making in disease modeling, it is imperative that the biological findings incorporate only high-quality knowledge.

Large Language Models (LLMs) Can Harness

Large language models (LLMs), when integrated into a retrieval-augmented generation (RAG) pipeline, present a game-changing opportunity to automate and expedite the curation of biological findings. By optimizing the extraction of insights from scientific papers, LLMs dramatically improve the scalability of this process. These language models can sift through far more papers than any individual could manually review and uncover a significantly larger volume of relevant findings.

CytoReason Develops Computational Disease Models

The team at CytoReason, a member of the NVIDIA Inception program, develops computational disease models, harnessing AI to mine vast amounts of molecular and textual data to support biopharma’s decision-making. By capturing mechanisms of action (MOAs), gene regulation, patient responses, and more, these models can simulate human diseases at the tissue, cellular, and gene levels.

RAG Pipeline Powered by NVIDIA NIM

The CytoReason team developed a RAG pipeline powered by NVIDIA NIM microservices to scale up the mining of biological findings integrated in CytoReason’s computational disease models. Figure 1 illustrates the flow.

Figure 1. CytoReason’s RAG pipeline for extracting biological expectations

Results

The team evaluated the RAG pipeline using a benchmark focused on gene expression in Crohn’s disease in the ileum. In this case, in a manual curation process that took days by the immunologist, a total of 101 genes were identified as differentially expressed (either upregulated or downregulated) between healthy and inflamed conditions.

The RAG pipeline extracted information about 99 genes in a matter of minutes, 70 of which overlapped with those identified through manual curation. The remaining 29 genes were new discoveries and were subsequently validated for accuracy by an expert. The evidence produced by the pipeline for all genes was accurate in 96% of the cases.

Summary

Mining biological insights from literature is a complex task that traditionally takes days and requires deep expertise in biology. By leveraging NVIDIA NIM and LLM technology, CytoReason has significantly reduced the time required for this process—from days to just a few hours. These results demonstrate that the precision of these insights is remarkably high, with even greater coverage of biological entities compared to those identified by human scientists.

Acknowledgments

We would like to thank NVIDIA for their professional, patient, and welcoming support throughout this project. We are also grateful to our colleagues at CytoReason who contributed their time and expertise.

Conclusion

The RAG pipeline powered by NVIDIA NIM and LLM technology has successfully demonstrated its ability to extract biological insights from literature, reducing the time required for this process and improving the scalability and accuracy of the results.

Frequently Asked Questions (FAQs)

What is the RAG Pipeline?

The RAG pipeline is a retrieval-augmented generation pipeline that integrates large language models (LLMs) with retrieval technology to extract biological insights from literature.

How Does the RAG Pipeline Work?

The RAG pipeline begins with a structured input, which is defined by four key parameters: entity type, disease, tissue, and conditions. The pipeline then retrieves relevant papers from scientific repositories, applies biological guardrails to refine the collection, and extracts evidence about the entities of interest using an NVIDIA LLM NIM.

What Are the Benefits of the RAG Pipeline?

The RAG pipeline benefits include improved scalability, accuracy, and speed in extracting biological insights from literature, reducing the time required for manual curation, and increasing the coverage of biological entities.

Can I Use the RAG Pipeline for My Own Research?

Yes, you can use the RAG pipeline for your own research. To get started, visit NVIDIA NIM for Developers.

Post Views: 37

Curating Biological Findings with NVIDIA NIM

Scientific Papers are Highly Heterogeneous

Scientists Must Carefully Interpret the Context

Large Language Models (LLMs) Can Harness

CytoReason Develops Computational Disease Models

RAG Pipeline Powered by NVIDIA NIM

Results

Summary

Acknowledgments

Conclusion

Frequently Asked Questions (FAQs)

What is the RAG Pipeline?

How Does the RAG Pipeline Work?

What Are the Benefits of the RAG Pipeline?

Can I Use the RAG Pipeline for My Own Research?

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

What is a Performance Review + Definition?

LEAVE A REPLY Cancel reply

Latest

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Categories

Useful Links

Our Newsletter