Date:

Curating Biological Findings with NVIDIA NIM

Scientific Papers are Highly Heterogeneous

Scientific papers are highly heterogeneous, often employing diverse terminologies for the same entities, using varied methodologies to study biological phenomena, and presenting findings within distinct contexts. Extracting meaningful insights from these papers requires a profound understanding of biology, a critical evaluation of methodologies, and the ability to discern robust findings from irrelevant or less reliable ones.

Scientists Must Carefully Interpret the Context

Scientists must carefully interpret the context, assess the reliability of experimental evidence, and identify potential biases or limitations in studies. Given the high-precision demands to support critical decision-making in disease modeling, it is imperative that the biological findings incorporate only high-quality knowledge.

Large Language Models (LLMs) Can Harness

Large language models (LLMs), when integrated into a retrieval-augmented generation (RAG) pipeline, present a game-changing opportunity to automate and expedite the curation of biological findings. By optimizing the extraction of insights from scientific papers, LLMs dramatically improve the scalability of this process. These language models can sift through far more papers than any individual could manually review and uncover a significantly larger volume of relevant findings.

CytoReason Develops Computational Disease Models

The team at CytoReason, a member of the NVIDIA Inception program, develops computational disease models, harnessing AI to mine vast amounts of molecular and textual data to support biopharma’s decision-making. By capturing mechanisms of action (MOAs), gene regulation, patient responses, and more, these models can simulate human diseases at the tissue, cellular, and gene levels.

RAG Pipeline Powered by NVIDIA NIM

The CytoReason team developed a RAG pipeline powered by NVIDIA NIM microservices to scale up the mining of biological findings integrated in CytoReason’s computational disease models. Figure 1 illustrates the flow.

Figure 1. CytoReason’s RAG pipeline for extracting biological expectations

Results

The team evaluated the RAG pipeline using a benchmark focused on gene expression in Crohn’s disease in the ileum. In this case, in a manual curation process that took days by the immunologist, a total of 101 genes were identified as differentially expressed (either upregulated or downregulated) between healthy and inflamed conditions.

The RAG pipeline extracted information about 99 genes in a matter of minutes, 70 of which overlapped with those identified through manual curation. The remaining 29 genes were new discoveries and were subsequently validated for accuracy by an expert. The evidence produced by the pipeline for all genes was accurate in 96% of the cases.

Summary

Mining biological insights from literature is a complex task that traditionally takes days and requires deep expertise in biology. By leveraging NVIDIA NIM and LLM technology, CytoReason has significantly reduced the time required for this process—from days to just a few hours. These results demonstrate that the precision of these insights is remarkably high, with even greater coverage of biological entities compared to those identified by human scientists.

Acknowledgments

We would like to thank NVIDIA for their professional, patient, and welcoming support throughout this project. We are also grateful to our colleagues at CytoReason who contributed their time and expertise.

Conclusion

The RAG pipeline powered by NVIDIA NIM and LLM technology has successfully demonstrated its ability to extract biological insights from literature, reducing the time required for this process and improving the scalability and accuracy of the results.

Frequently Asked Questions (FAQs)

What is the RAG Pipeline?

The RAG pipeline is a retrieval-augmented generation pipeline that integrates large language models (LLMs) with retrieval technology to extract biological insights from literature.

How Does the RAG Pipeline Work?

The RAG pipeline begins with a structured input, which is defined by four key parameters: entity type, disease, tissue, and conditions. The pipeline then retrieves relevant papers from scientific repositories, applies biological guardrails to refine the collection, and extracts evidence about the entities of interest using an NVIDIA LLM NIM.

What Are the Benefits of the RAG Pipeline?

The RAG pipeline benefits include improved scalability, accuracy, and speed in extracting biological insights from literature, reducing the time required for manual curation, and increasing the coverage of biological entities.

Can I Use the RAG Pipeline for My Own Research?

Yes, you can use the RAG pipeline for your own research. To get started, visit NVIDIA NIM for Developers.

Latest stories

Read More

Meta Trains AI Models with EU User Data

Meta Plans to Use EU Users' Content to Train...

90% of B2B Buyers Click On Citations

Google's AI Overviews Have Changed How Search Works AI Overviews...

Unreal Engine Powers Realistic Photography Game

Unreal Engine: The Power Behind Lushfoil Photography Sim Creating Stunningly...

Google’s Veo 2

Google Expands Veo 2 Video-Generating AI Model to Premium...

Notion Mail Falls Short

Notion Mail Adds AI Perks, But Skips Crucial Notion...

Titanfall 2: The Waiting is Over

Fans Hold Out Hope for Titanfall 3, But Indie...

LEAVE A REPLY

Please enter your comment!
Please enter your name here