Chips & Hardware · Report

Groq's Language Processing Unit (LPU) integrated into NVIDIA Vera Rubin platform, signaling mainstream acceptance of specialized inference accelerators.

Validates alternative chip architectures for inference beyond GPUs; marks industry shift to compute specialization and heterogeneous systems.

Trade pressSlicast · June 25, 2026 · China · Source: 雷锋网

importance 87

When AI transitioned from the training era to the inference era, single universal architectures began hitting efficiency boundaries. Change followed—the narrative of "conquering everything with GPUs alone" became difficult to sustain, and specialized division of labor gradually became a consensus in the chip industry.

Google is advancing training-inference separation in its new-generation TPUs; Anthropic is betting on in-memory computing architectures; SambaNova released a "CPU+GPU+RDU" system solution; Cerebras chose to challenge traditional GPU clusters with wafer-scale chips.

With Groq's LPU (Language Processing Unit) being incorporated into NVIDIA's Vera Rubin platform, the LPU—previously viewed as a "niche path"—entered mainstream AI infrastructure for the first time. For the industry, this not only signifies recognition of a new chip type, but marks an inflection point: the inference era is beginning to embrace the logic that different chips handle different tasks.

China's domestic market is similarly feeling this shift. Surrounding dataflow architectures, high-bandwidth SRAM storage, and other inference acceleration approaches, various new solutions are emerging continuously, with players hoping to tell their own LPU story coming forward one after another.

As AI chip specialization trends become increasingly clear, is the LPU a cyclical hot topic or a permanent new job category in the inference era? And as the track becomes more crowded, perhaps the LPU addresses a genuine need—but is an independent LPU company a viable business?

NVIDIA's latest solution combines 25% Groq LPUs with 75% Vera Rubin to handle endless streams of high-value token generation demands.

Behind this approach lies a rewriting of rules in the Agent era: AI applications are no longer one-off Q&A sessions; continuous reasoning workflows are triggering a flood of tokens. Infrastructure competition is escalating from single-chip performance comparisons to system-level efficiency optimization.

What became clear first was the distinction between Prefill and Decode—one focused more on compute density, the other more dependent on response speed and system throughput capability.

But the industry quickly discovered that even within Decode, different workloads have different maximum requirements: Attention is busy transporting and reading massive KV Cache, while many token generation tasks fall on FFN (feedforward neural networks).

Once differences are recognized, the need for differentiated collaboration becomes more urgent. Different types of chips are entering inference systems, each handling the work they do best.

Groq's LPU re-entered market view against this backdrop, serving as a new component in the Vera Rubin platform, targeting FFN-related workflows in LPX system form.

"Ultra-low-latency inference and other extreme scenarios unsuitable for GPU processing can be handed off to the LPU," explains architecture engineer Xiaofang. "It's like opening a dedicated express lane to serve customers."

In fact, the LPU didn't appear out of nowhere. Groq was founded in 2016, and its core architecture design was also born in the previous AI era. But for a long time afterward, these specialized chips never entered the mainstream market.

According to sources, after NVIDIA first opened its NVLink interconnection ecosystem to partners in early 2025, Groq proactively sought access opportunities, hoping to gain protocol support from what was originally designed for GPU-to-GPU communication.

As the feasibility of GPU-LPU collaborative operation was validated, both parties' collaboration gained practical foundation. And changes in NVIDIA's own strategy brought greater imagination space.

AI systems architect Xu observed that more new chips designed specifically for Transformer inference will emerge. "The window for achieving leadership with a single chip is narrowing," he says. "But through system-level architecture innovation, NVIDIA's lead advantage could extend from a few months to 1-2 years."

In other words, for NVIDIA, introducing the LPU isn't about replacing GPUs, but finding roles better suited to specific inference tasks.

Specialization brings new opportunities for the LPU, but converting opportunities into markets is another matter. As more companies crowd into the LPU track, a more practical question is surfacing: what's the real substance of the technological advantages LPUs are expected to deliver?

"Single-chip deterministic latency isn't unique to LPUs—all ASICs can achieve that. What's truly difficult is precise orchestration across multiple chips, racks, and clusters," says one expert. In her view, this is LPU's deepest moat and a barrier domestic non-major companies struggle to overcome.

But Tim, who once led chip software stack design at a major tech company, takes a more reserved stance, believing compiler capability value is closely tied to model form factors.

During the CNN era, model structures were diverse and operators numerous, giving compilers ample opportunity to shine. But as Transformer became industry standard, large model core operators continuously converged, with many layer structures highly repetitive.

Meanwhile, the rise of dynamic architectures like MoE (Mixture of Experts) is eroding the advantages of fully static systems.

"Practically all top-tier models now have MoE structures," Tim says. "The dynamic nature during inference isn't particularly friendly to fully static systems."

He further explains that different requests activate different expert combinations during inference, and this information cannot be known at compile time.

Mark expresses similar views. His non-GPU chip startup has already secured investments from multiple leading dollar funds.

"To ensure the system runs at its predetermined pace, the compiler can only plan for the worst case," he points out. "The ossified hardware side also needs to retain certain redundancy to maintain overall synchronization, which causes some theoretical advantages to be offset."

Regarding LPU software capabilities, the industry hasn't reached consensus. By comparison, another storage "trump card"—SRAM—seems easier to quantify. Many practitioners believe this is the LPU's core competitiveness.

NVIDIA's published data shows a single Groq 3 LPU has 150 TB/s SRAM bandwidth, roughly 45 times that of H100 HBM3. In a 256-LPU LPX cabinet, total bandwidth is further pushed to 40 PB/s (note: 1 PB/s = 1000 TB/s).

Beyond high-bandwidth capability, chip industry veteran Yang also sees this path's advantage in circumventing HBM supply chain and advanced packaging constraints.

In current AI chip cost structures, storage's influence continues rising. Epoch AI data shows HBM's share of AI chip component costs has grown from 52% in early 2024 to 63% by end of 2025.

As HBM consumes more costs, the market is reassessing SRAM's value, though disagreement persists.

Seasoned chip product manager Gu plainly states: "SRAM is actually a massive defect of the LPU." She believes SRAM's greatest trait is speed, but the cost is small capacity and high per-unit expense.

However, IO Capital founder Zhao doesn't entirely agree. He thinks simply comparing storage unit prices lacks significance.

"While SRAM only has hundreds of MB versus HBM's tens or even hundreds of GB, even if SRAM's per-unit price is higher than HBM, faced with the capacity gap, HBM's total cost might ultimately be higher."

SRAM has its own capacity anxieties. Xiaodong, a chip compute architecture expert with over a decade of experience, points out that SRAM is directly integrated into the chip and must share the same silicon area with compute units. This means area allocation is always a difficult problem.

"A DRAM storage unit needs only 1 transistor and 1 capacitor, while SRAM needs 6 transistors," he adds. "In the same area, SRAM naturally stores less data."

Public data shows Groq 3 LPU integrates roughly 500MB SRAM, while TPU 8i has about 384MB. Though Cerebras' WSE-3 raised capacity to 44GB through wafer-scale integration, the cost is double punishment in yield and expense.

Discussion of SRAM cheapness versus expense has different angles. But what's more worth questioning is: in the inference era, what metrics should measure value?

Mark believes it's tokens. In his view, a shift from "system cost" to "token cost" evaluation metrics is happening.

Over the past years, the industry got used to discussing "how many cards to deploy a model." Thus, many vendors emphasized completing deployment with fewer GPUs.

He cites examples where some solutions can deploy models with 8 GPUs, but inference costs aren't necessarily lowest. After DeepSeek publicly adopted 144 cards to build its inference cluster, the industry realized another possibility.

"Though overall system costs increased significantly, larger cluster scale brought higher bandwidth, higher token throughput, and lower per-token cost," Mark analyzes.

Disagreement hasn't vanished, and LPU's advantages come with real costs. But at least one point has formed consensus: LPU has secured its entry ticket to the inference system.

What it must answer next is the persistent real-world question the market keeps asking—is this a business that can continuously generate revenue?

Before gaining NVIDIA's backing, Groq already leveraged independent end-to-end inference deployment capability, winning Saudi Arabia's inference infrastructure project, deploying large compute centers in Europe, and entering Meta's Llama ecosystem.

"Enterprises betting on this track must have target customers," Zhao explains. "Because regardless of software compilation, optimization ultimately targets specific applications."

In other words, LPU's commercialization challenges aren't merely technical implementation, but whether anyone is willing to pay. But an issue impossible to ignore: those most needing LPUs often have the strongest ability to self-develop.

Xu observed that large model companies and tech giants have already started moving. "Compared to GPUs, LPUs are much simpler—with one or two years you can build one."

But customers becoming competitors isn't the worst news. "Startups can't survive on LPUs alone; they need to find the 'masses.' NVIDIA is adding a 'Ferrari' on top of already having the 'masses'—that's icing on the cake," Gu plainly states.

Mark points out such specialization will continue deepening. "The decoupling degree between Attention and FFN is high, and communication bandwidth requirements between them aren't high. Therefore, heterogeneous systems won't bring the massive costs outsiders imagine."

Tim also believes future inference solutions will likely exist in heterogeneous form. "When every optimization brings hundred-million-dollar returns, R&D costs are easily amortized."

According to Zhao's observation, many enterprises are already exploring similar paths—using large-capacity SRAM and distributed storage to complete inference workloads. "It's just called LPU now," he cuts through.

Xiaofang recalled DPU's development trajectory. Around 2020, as the DPU concept emerged, numerous startups flooded the track; years later, many transformed. In her view, LPU might replay a similar script, partly because of long market cultivation cycles.

Zhao explains any new computing architecture needs time to settle, just as NVIDIA GPUs took a decade to achieve mass adoption.

But for startups, this is the deadliest risk. As highly specialized ASICs, LPUs naturally depend on current mainstream model architectures. If foundational models change fundamentally, related optimization value might be reassessed.

To this, Mark responds from another angle: "This actually gives startups opportunities, because major companies may not want to bear such high risk."

Meanwhile, Xiaodong is relatively optimistic. He points out that since CNN-era AlexNet sparked modern deep learning waves, though AI paradigms continuously evolved over the past decade, underlying logic hasn't fundamentally changed. Future new architectures are more likely to be Transformer-Plus.

Tim offers similar judgment: "As long as models need to filter, invoke, and combine information from massive knowledge, demand for high bandwidth won't disappear. Designing chips based on this need, even if Transformer gets replaced, the chips themselves won't become obsolete."

The market never lacks new chip stories. What truly determines whether an LPU company survives isn't necessarily how advanced its architecture is, but whether it can find customers, scenarios, and ecosystems before the market matures.

After all, the inference era might truly need more and more "Ferraris." But for most startups, what's harder than building a Ferrari is finding someone willing to continuously buy the "masses-plus-Ferrari" combination.

Read the original