Chips & Hardware · Report

NVIDIA's full-stack inference software on Blackwell has reduced token costs by up to 5x in one month by coordinating har

NVIDIA official — first-hand confirmation of roadmap / product.

Official disclosureSlicast · July 3, 2026 · US · Source: NVIDIA Blog

As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per token: how many useful tokens they can deliver per dollar, per watt and within required latency targets.

Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA's full-stack inference software continuously improves hardware performance. On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month.

The nature of AI workloads has fundamentally changed. Traditional web, search and software-as-a-service applications were relatively predictable: a user might load a page, refresh a feed or update a business record. These requests typically followed similar software paths, reading from or writing to a database, and scaled by adding more of the same servers. By contrast, agents can reason, plan, call tools, spin up specialist subagents and manage massive context across multi-turn workflows. They turn a single request into a distributed computing problem that can span hundreds of subagents, thousands of tasks and multiple large language models, running across GPUs, CPUs, DPUs and storage systems. The software stack determines whether that complexity turns into wasted capacity or lower cost per token.

Lower cost per token comes from turning individual optimizations into system-level performance. NVIDIA's inference software stack connects three layers that work together as one system, allowing individual optimizations to compound. Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x. Capturing that gain in production requires coordination across the full inference stack, from production operations and model runtimes to kernels, communication libraries and hardware access.

This advantage is amplified by the open source ecosystem. Many of today's most widely used open source AI frameworks are built natively on NVIDIA CUDA, meaning new research and software optimizations run with leading performance on NVIDIA GPUs from day zero. PyTorch, launched in 2016 with native CUDA support, has coevolved with NVIDIA's architecture, giving developers access to innovations such as Tensor Cores, Transformer Engine and NVFP4 directly through a familiar framework. When breakthroughs such as DFlash speculative decode, which delivers up to 15x more throughput on existing hardware, or FastVideo, which generates 1080p videos in less than five seconds, land in PyTorch, they run instantly on NVIDIA.

This open source momentum extends to new frontier models. When DeepSeek V4 was released, leading inference frameworks like vLLM and SGLang had day-zero deployment recipes for the NVIDIA Blackwell architecture, making the model accessible across millions of Blackwell GPUs. DeepSeek V4 performance on Blackwell improved by up to 5x within about a month across these frameworks, cutting token costs to roughly one-fifth of previous levels. This is the open source flywheel: more developers optimize CUDA-native inference paths, more production deployments feed back into the ecosystem, and each software improvement increases delivered token output while lowering cost per token over time.

Read the original