Qualcomm reveals HBC near-memory AI architecture with AI250 and AI350 accelerators, claiming 6x higher bandwidth-per-watt vs. HBM and 200x capacity vs. on-chip SRAM.
The memory wall remains a major performance limiter for many AI workloads. While high-bandwidth memory (HBM) is commonly employed, it is not always sufficient, since compute capability is growing faster than memory bandwidth. On Wednesday, Qualcomm introduced its high-bandwidth compute (HBC) near-memory architecture, designed to break the memory wall and enable linear performance scaling for certain AI workloads.
Qualcomm's approach is straightforward: the company disaggregates the AI accelerator from the system-on-chip (SoC) and places it directly beneath the LPDDR DRAM stack. The HBC accelerator connects to the LPDDR stack using through-silicon vias, providing maximum bandwidth and capacity without requiring expensive HBM memory or advanced packaging. Though Qualcomm does not disclose the actual bandwidth HBC provides, the company claims 6X higher bandwidth-per-watt compared to HBM and over 200X capacity compared to on-chip SRAM.
"We have separated the AI accelerator from the XPU and placed the XPU directly beneath a DRAM stack," said Tony Pialis, Executive Vice President and General Manager of Data Center Business at Qualcomm. "This is very important because it gives us the performance advantages of SRAM with the density and capacity of stacked memory. In effect, the congestion associated with HBM is gone. The value to the industry is lower power consumption, less heat, and the elimination of the costly silicon interposer used by HBM solutions. We can also deploy multiple HBC stacks within a single compute device using standard packaging, which delivers a significant performance-per-cost advantage."
Placing DRAM on or adjacent to logic is not new; all major DRAM manufacturers have experimented with near-memory compute architectures, though none have achieved widespread adoption. More recently, GUC, a fabless ASIC design service company, proposed its DRAM-on-Logic (DoL) technology, which places one to four DRAM layers on top of logic to achieve around 5 TB/s of memory bandwidth, offering higher performance than some HBM3E subsystems without requiring expensive advanced packaging or HBM3E stacks.
Since Qualcomm does not disclose actual performance numbers, direct comparison with GUC's offering remains difficult. A significant caveat regarding HBC is that Qualcomm has not disclosed what the HBC accelerator actually performs. In theory, it could be anything: a transformer-specific near-memory engine, a general array of tensor cores, or preprocessing logic for AI inference or training.
Qualcomm's HBC roadmap includes the following milestones: the AI200 accelerator, due later this year, will rely on LPDDR5X and offer 43 TB of RAM per rack. Its successor, the AI250, will use 1st Generation HBC and deliver 18X the bandwidth of the AI200. The AI300 will employ 2nd Generation HBC, providing 54X the bandwidth of the AI300.