Date:

Reliable Model Training on NVIDIA DGX Cloud

Minimizing Downtime

As a model builder, when you encounter an error during training, the key challenge is identifying the cause, locating the issue, and finding a way to keep the job moving forward to avoid delays. This delay is further exacerbated in environments where engineer intervention is required for recovery, often adding hours to triage and remediation.

Reducing downtime requires both reactive and proactive systems throughout training. At scale, errors are inevitable, and the speed of detection and recovery is critical. For both application and hardware failures, error attribution is key.

Error Attribution

For error attribution, we broadly categorize the kind of errors that researchers encounter into the following main buckets:

  • Immediate crashes: Stem from hardware faults such as bios, power supply or thermal issues, uncorrectable ECC errors, silent data corruption (NaNs in intermediate results), or network instability (link flapping).
  • Hangs in communication libraries: Often manifest as PyTorch NCCL watch dog errors and Transformer Engine communication hangs. Hangs are often due to cascading dependencies in data transfer from the filesystem (e.g., for input data) and from tensors (e.g., gradients, intermediate activations, and so on) across the east-west (E/W) network. This highlights the need for robust fault tolerance, containment, and early detection mechanisms within libraries and applications.
  • Speed regressions: These encompass both transient slowdowns (e.g., temporary network or storage issues) and persistent bottlenecks (e.g., a consistently slow GPU in a large cluster). These regressions can significantly affect overall training speed and efficiency.

Cluster Telemetry

This telemetry includes storage servers, covering metadata operations, and read/write operations and switches. This visibility is crucial because a failure in one node can often spread to other nodes through communication calls, passing corrupted gradients, or overloading the storage system.

Node Telemetry

Periodic health checks at the node level ensure that key hardware and software components such as GPUs, CPUs, memory, network, storage, and services are functioning correctly. Preliminary checks before a job starts also validate hardware status, verify software dependencies, and configure the environment for the task.

Application Logs

Applications have critical knowledge of the key control points, invariants, and measure of progress, including system errors and performance patterns. They provide one of the strongest signals for error attribution, especially when correlated with historical data in a central repository to spot recurring failures over time.

Unified Telemetry

Analyzing this temporal data across both intra-job (within a single job) and inter-job (across multiple jobs) contexts helps identify recurring issues, detect patterns, and take proactive measures rather than reactive.

Conclusion

We’ve found that end-to-end resilience requires a holistic view. High uptime depends on a comprehensive approach that spans both infrastructure and developer experience.

FAQs

Q: What is the key challenge in minimizing downtime?
A: Identifying the cause, locating the issue, and finding a way to keep the job moving forward to avoid delays.

Q: What is error attribution, and why is it important?
A: Error attribution is the process of identifying the root cause of an error. It is important because it helps in developing solutions and processes that enable researchers to maintain momentum and keep workflows moving forward.

Q: What are some common types of errors that researchers encounter?
A: Immediate crashes, hangs in communication libraries, and speed regressions are some common types of errors that researchers encounter.

Q: How does unified telemetry help in reducing downtime?
A: Unified telemetry helps in identifying recurring issues, detecting patterns, and taking proactive measures rather than reactive. It enables researchers to leverage infrastructure data to improve debugging and the operations team to use application insights to improve system automations and reduce hardware downtime.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here