Time-to-First-Token Matters for Real-Time Use Cases
Many of the most exciting applications of large language models (LLMs), such as interactive speech bots, coding co-pilots, and search, need to begin responding to user queries quickly to deliver positive user experiences. The time it takes for an LLM to ingest a user prompt (and context, which can be sizable) and begin outputting a response is called time to first token (TTFT).
NVIDIA GH200 NVL32 Supercharges TTFT for Long Context Inference
To generate the first new token in response to an inference request, the input tokens must be processed by the LLM. This phase of inference, known as prefill, often has a large number of tokens and thus benefits from increased aggregate compute performance. It can be accelerated by splitting the calculations across multiple GPUs using parallelism techniques, such as tensor parallelism.
Llama 3.1 70B
A single GH200 NVL32 system achieves a TTFT of just 472 milliseconds when running Llama 3.1 70B, using an input sequence length of 32,768. In practical terms, this means that Llama 3.1 70B can begin outputting a summary of a 90-page document or coding suggestions on thousands of lines of code, in less than half a second.
Llama 3.1 405B
Llama 3.1 405B requires substantially more compute to generate the first token of a response, as the model incorporates nearly 6X the parameter count of Llama 3.1 70B. A GH200 NVL32 system can achieve a TTFT of about 1.6 seconds using a 32,768 token input. And, using a small codebase-sized 122,880 token input, GH200 NVL32 can begin responding in just 7.5 seconds.
Inference Continues to be a Hotbed of Invention
The pace of inference innovation across serving techniques, runtime optimizations, kernels, and more has been extraordinary. Advancements like in-flight batching, speculative decoding, FlashAttention, key-value caching, and more have been developed by both industry and academia. Collectively, these innovations are enabling more capable models and systems to be deployed efficiently and more cost-effectively in production, making powerful AI capabilities more accessible to the entire NVIDIA ecosystem.
Next Up: Accelerating Agentic Workflows
Agentic workflows perform tree search, self-reflection, and iterative inferences to reason and produce answers to complex queries. This means that the number of inferences per prompt will grow by orders of magnitude. With each successive inference, we would need to process the aggregate response in the next agent as a new context — thus fast TTFT becomes even more important as workflows scale.
NVIDIA Blackwell GB200 NVL72 Powers a New Era of Computing
Looking ahead, as model sizes continue to grow rapidly, and as models support even longer context lengths, and agentic workflows become more popular, the amount of delivered compute performance required for fast inference continues to rise.
Conclusion
In this article, we have shown how NVIDIA GH200 NVL32 can achieve the fastest published TTFT for the Llama 3.1 models, even at very long contexts. We have also introduced the NVIDIA Blackwell GB200 NVL72, which delivers the next giant leap for generative AI and accelerated computing.
FAQs
Q: What is Time-to-First-Token (TTFT)?
A: TTFT is the time it takes for an LLM to ingest a user prompt (and context) and begin outputting a response.
Q: What is the significance of fast TTFT in real-time use cases?
A: Fast TTFT is crucial for real-time use cases, as it enables a seamless and interactive user experience.
Q: What is the NVIDIA GH200 NVL32 system?
A: The NVIDIA GH200 NVL32 system is a rack-scale solution that connects 32 NVIDIA GH200 Grace Hopper Superchips using the NVLink Switch system, providing outstanding TTFT for long-context inference.
Q: What is the NVIDIA Blackwell GB200 NVL72?
A: The NVIDIA Blackwell GB200 NVL72 is a next-generation compute tray that delivers up to 20 PFLOPS of FP4 AI compute and 1,800 GB/s of GPU-to-GPU bandwidth.

