China's LineShine all-CPU exaflops-class supercomputer overtakes US systems in TOP500 ranking, advancing indigenous compute capability.
It has been nine years since a Chinese HPC supercomputer topped the High Performance Linpack performance rankings, but China did break through the exascale flops barrier at 64-bit precision before the United States did—on two different systems. While China did not publicly trumpet these achievements, enough information leaked to US experts to establish the record.
The Sunway OceanLight machine, installed at NSC Qingdao and first discussed in February 2021, came first. Based on the Sunway SW26010-Pro CPU with 41.93 million cores, the full system delivered 1.22 petaflops on HPL and is rumored to have been operational by March 2021.
The Tianhe-3 supercomputer followed, based on a hybrid Phytium 2000 Arm processor paired with the Matrix 3000 DSP co-processor. With a peak theoretical performance of 2.05 exaflops, it delivered approximately 1.57 exaflops on HPL. Initially deployed at NSC Guangzhou at lower performance levels in October 2021, the system reached these figures when fully fleshed out in December 2023. An earlier prototype using Matrix 2000+ DSP coprocessors achieved 1.3 exaflops on HPL against 1.7 exaflops peak in late 2021.
Both systems surpassed the United States' Oak Ridge "Frontier" supercomputer, a hybrid system pairing AMD "Trento" Epyc CPUs with four AMD "Aldebaran" MI250X GPU accelerators. Frontier had approximately 8.7 million cores and achieved 1.19 exaflops on HPL with 1.68 exaflops peak. It entered production in May 2022—more than a year after China's achievements.
China accomplished this using older process technologies, operating at higher temperatures, consuming significant space, and presumably at substantial cost. By necessity, these systems relied on indigenous components from Semiconductor Manufacturing International Corp (SMIC), as US export controls prevented access to Nvidia and AMD GPUs. This constraint reflects China's strategic commitment to computing independence.
The new top-ranked machine on the Top500 supercomputer list, "LineShine," installed at NSC Shenzhen, exemplifies how technological advancement over five years has enabled systems that exceed their predecessors in both scale and efficiency. LineShine is based on an Armv9.2 CPU core featuring SVE2 vector units and SME matrix math units alongside integer processing—architecturally similar to Intel's P-core Xeon processors, which combine integer processing with AVX vector and AMX matrix units.
The LX2 chip, designed by NSC Shenzhen in conjunction with Huawei's HiSilicon division, contains 304 active cores per socket—with additional cores likely disabled for yield optimization. The LineShine system uses a proprietary LingQi LQLink interconnect, likely based on InfiniBand technology or a specialized Ethernet variant.
The LX2 CPU delivers sufficient FP64 performance through its SVE2 and SME units that only 13.79 million cores are required to achieve 2.74 exaflops peak theoretical performance—32.9 percent fewer cores than OceanLight while delivering 46.7 percent more performance. On the HPL test, LineShine achieves just under 2.2 exaflops, making it 21.5 percent more powerful than the previous leader, Lawrence Livermore's "El Capitan" supercomputer, which uses AMD MI300A compute engines.
Details about LineShine emerged from an NSC Shenzhen paper titled "Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials" (April 17) and from presentations at the second annual International Forum for HPC and AI Co-Driven Innovation (HACI 2026), held in Shenzhen from May 22 to May 25. While the LineShine presentation remained confidential, slides were shared publicly by Torsten Hoefler, chief architect for AI and machine learning at CSCS in Switzerland and professor at ETH Zurich, and by Tadashi Ogawa, technical sales manager for Panasas in Japan.
The same paper referenced China New-generation Intelligent Supercomputer (CNIS), another exascale-class system based on nodes with undisclosed 64-bit CPUs and eight GPUs. The CNIS system comprises 5,632 nodes. Each node's host processor runs at 2.4 GHz with 64 cores in NUMA architecture, connected to 8-channel DDR5-6400 memory and PCIe Gen5 interfaces, delivering 64 GB/s host-to-device bandwidth. Each GPGPU provides 32.7 TFLOPS (FP64), 65.5 TFLOPS (FP32), and 470 TFLOPS (FP16) peak performance, with 64 GB HBM offering 1.8 TB/sec bandwidth, 320 SIMD units, 768 KB registers, 64 KB LDS, and 8 MB L2 cache. Nodes connect through a proprietary InfiniBand-like RDMA network with three-layer Clos dual-plane topology providing 4×400 Gb/sec per node.
The LX2 architecture employs two interlinked chiplets, each containing 48 core blocks with four cores per block—yielding 192 raw cores per chiplet and 384 per socket. The exposed 304 cores per socket reflect approximately 79.2 percent core yield, consistent with SMIC foundry capabilities. The chiplets likely use SMIC's 7-nanometer N+3 process and operate at 1.55 GHz—well below SMIC's 3 GHz capability. This conservative clock speed balances memory and core performance, avoiding exponential power consumption increases. At 1.55 GHz, the LX2 complex consumes 690 watts, representing a deliberate thermal optimization that NSC Shenzhen compensates for through massive scale.
The LX2 includes eight HBM stacks per chiplet (sixteen total per socket), each allocated to a block of 24 cores. This configuration yields 32 GB of HBM memory and 4 TB/sec bandwidth per chiplet, totaling 64 GB per socket at 8 TB/sec—likely a variant of HBM2E memory. NSC Shenzhen additionally employs 3D stacking with custom DRAM logic wafers, potentially utilizing LPDDR5X memory from ChangXin Memory Technologies.