Date:

Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud

Challenges and Best Practices for Building Large Language Models

In recent years, large language models (LLMs) have achieved extraordinary progress in areas such as reasoning, code generation, machine translation, and summarization. However, despite their advanced capabilities, foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.

Overcoming Limitations

Overcoming those limitations can be achieved with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG). This requires high-quality, domain-specific datasets, a robust AI platform (software and hardware stack), and advanced AI expertise.

iGenius and Colosseum 355B

iGenius, an Italian technology company specializing in artificial intelligence for enterprises operating in highly regulated sectors, such as financial services and public administration, aimed to develop a state-of-the-art foundational LLM within a tight timeline. They faced challenges in accessing large-scale GPU clusters (thousands of GPUs) and securing support for highly scalable training frameworks.

Collaboration with NVIDIA

During this engagement, iGenius developed the Colosseum 355B LLM, designed and developed for highly regulated environments, which provides businesses with confidence in the accuracy of the model output and security, knowing none of their information or IP is ever compromised. iGenius chose to collaborate with NVIDIA to accelerate their LLM development for Colosseum 355B.

DGX Cloud Environment

Improving LLM reasoning capabilities requires a robust, distributed hardware and software solution, where accelerated compute, networking, storage, and libraries must seamlessly work together. Any bottleneck in the system can significantly slow or even stop the entire training process. NVIDIA DGX SuperPOD negates the risk and complexity of this by providing a fully optimized solution that is designed, built, and validated by NVIDIA before handing over a ready-to-go system to their customers.

CPT and Alignment

iGenius used large-scale CPT and alignment on specific domains to build Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework. They reduced computational costs and improved efficiency through CPT in FP8 precision.

Challenges and Best Practices

As training scales, minor issues become critical. Running the Colosseum 355B training job on 3K GPUs can take 15–20 minutes just to load the checkpoint, which consumes 5 TB of memory footprint. A storage system performing well with multiple small jobs across thousands of GPUs may struggle when a single workload requires all GPUs to read and write checkpoints simultaneously, causing delays and potential timeouts.

Best Practices

Here are some best practices and lessons learned when running LLM training at scale:

  • Explore the basics at a reduced scale
  • Monitor effectively and track at scale
  • Explore the basics at reduced scale
  • Progressive scaling is key
  • Small debug model: Exploring large-scale training configurations is highly challenging
  • End-to-end process testing
  • Robust checkpointing
  • Expansions from minimal to large-scale distribution
  • Dataset testing
  • Monitor effectively and track at scale
  • Accurate experiment tracking
  • Infrastructure observability
  • Predefined tests

Conclusion

By using large-scale CPT and alignment on specific domains, iGenius built Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework. iGenius reduced computational costs and improved efficiency through CPT in FP8 precision.

FAQs

Q: What are the limitations of foundation models?
A: Foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.

Q: How can these limitations be overcome?
A: These limitations can be overcome with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG).

Q: What is Colosseum 355B?
A: Colosseum 355B is a foundational LLM developed by iGenius using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework.

Q: What are the best practices for running LLM training at scale?
A: The best practices include exploring the basics at a reduced scale, monitoring effectively and tracking at scale, exploring the basics at reduced scale, progressive scaling, and more.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here