Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud

Challenges and Best Practices for Building Large Language Models

In recent years, large language models (LLMs) have achieved extraordinary progress in areas such as reasoning, code generation, machine translation, and summarization. However, despite their advanced capabilities, foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.

Overcoming Limitations

Overcoming those limitations can be achieved with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG). This requires high-quality, domain-specific datasets, a robust AI platform (software and hardware stack), and advanced AI expertise.

iGenius and Colosseum 355B

iGenius, an Italian technology company specializing in artificial intelligence for enterprises operating in highly regulated sectors, such as financial services and public administration, aimed to develop a state-of-the-art foundational LLM within a tight timeline. They faced challenges in accessing large-scale GPU clusters (thousands of GPUs) and securing support for highly scalable training frameworks.

Collaboration with NVIDIA

During this engagement, iGenius developed the Colosseum 355B LLM, designed and developed for highly regulated environments, which provides businesses with confidence in the accuracy of the model output and security, knowing none of their information or IP is ever compromised. iGenius chose to collaborate with NVIDIA to accelerate their LLM development for Colosseum 355B.

DGX Cloud Environment

Improving LLM reasoning capabilities requires a robust, distributed hardware and software solution, where accelerated compute, networking, storage, and libraries must seamlessly work together. Any bottleneck in the system can significantly slow or even stop the entire training process. NVIDIA DGX SuperPOD negates the risk and complexity of this by providing a fully optimized solution that is designed, built, and validated by NVIDIA before handing over a ready-to-go system to their customers.

CPT and Alignment

iGenius used large-scale CPT and alignment on specific domains to build Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework. They reduced computational costs and improved efficiency through CPT in FP8 precision.

Challenges and Best Practices

As training scales, minor issues become critical. Running the Colosseum 355B training job on 3K GPUs can take 15–20 minutes just to load the checkpoint, which consumes 5 TB of memory footprint. A storage system performing well with multiple small jobs across thousands of GPUs may struggle when a single workload requires all GPUs to read and write checkpoints simultaneously, causing delays and potential timeouts.

Best Practices

Here are some best practices and lessons learned when running LLM training at scale:

Explore the basics at a reduced scale
Monitor effectively and track at scale
Explore the basics at reduced scale
Progressive scaling is key
Small debug model: Exploring large-scale training configurations is highly challenging
End-to-end process testing
Robust checkpointing
Expansions from minimal to large-scale distribution
Dataset testing
Monitor effectively and track at scale
Accurate experiment tracking
Infrastructure observability
Predefined tests

Conclusion

By using large-scale CPT and alignment on specific domains, iGenius built Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework. iGenius reduced computational costs and improved efficiency through CPT in FP8 precision.

FAQs

Q: What are the limitations of foundation models?
A: Foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.

Q: How can these limitations be overcome?
A: These limitations can be overcome with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG).

Q: What is Colosseum 355B?
A: Colosseum 355B is a foundational LLM developed by iGenius using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework.

Q: What are the best practices for running LLM training at scale?
A: The best practices include exploring the basics at a reduced scale, monitoring effectively and tracking at scale, exploring the basics at reduced scale, progressive scaling, and more.

Post Views: 50

Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud

the ‘Friend Yet Foe’ Paradox

Assetisation, LinkedIn, and the Future of Work

Assetisation and the reconfiguration of work

Amazon workers in Coventry helped make this happen

Futures of Work ~ How Big Tech threatens European capitalism and what Europe and unions can do about it

the ‘Friend Yet Foe’ Paradox

Assetisation, LinkedIn, and the Future of Work

Assetisation and the reconfiguration of work

Amazon workers in Coventry helped make this happen

Futures of Work ~ How Big Tech threatens European capitalism and what Europe and unions can do about it

Futures of Work ~ Bricolage

SmartThings Blog

Generate single title from this title The K-12 AI Safety Gap: Why Parents Need Visibility Now in 100 -150 characters. And it must return...

LEAVE A REPLY Cancel reply

Latest

the ‘Friend Yet Foe’ Paradox

Assetisation, LinkedIn, and the Future of Work

Assetisation and the reconfiguration of work

Categories

Useful Links

Our Newsletter