Intel and AMD release ACE CPU extension, adding AI-optimized instruction set to x86, improving matrix operation efficiency.
Running AI Models on x86 CPUs Is Becoming Easier and Faster
Most discussions you hear about "running AI models" involve some form of GPU, but not all AI tasks are well-suited for this hardware. Smaller models or latency-sensitive single-user operations can benefit from running on CPUs, as this avoids the overhead of shuttling data back and forth to the GPU. There are also many scenarios where no GPU is available at all, or only performance-limited integrated chips are accessible. Intel and AMD have recently released comprehensive specifications for ACE CPU extensions, making it easier and more energy-efficient to run these AI tasks on x86 processors.
ACE achieves this by providing a technical standard that leverages existing AVX10 registers but adds dedicated silicon for matrix multiplication. This brings multiple benefits, but the primary advantages are better power efficiency, simpler development and optimization, and the ability to leverage AVX's 512-bit operations. The latter makes integration with existing designs easier by not requiring ACE-specific inputs.
Matrix multiplication is the cornerstone of AI workloads: take a numerical table and run multiply-accumulate cycles across it. This has always been possible on most CPUs, albeit with limited speed. Even today, running these cycles consumes considerable power, even when leveraging x86's AVX10 multiply-accumulate instructions—technically a workaround, since AVX was not designed with 2D matrix multiplication operations in mind.
For the same number of input vectors, ACE can perform 16 times more operations compared to AVX10. Note that this doesn't necessarily translate to a 16x speed increase, as that depends on the specific implementation, but it's reasonable to expect that Intel and AMD will allocate more silicon to this task in future designs to improve performance. Additionally, because each ACE instruction performs more work than its equivalent AVX10 loop, there is less CPU instruction overhead, and RAM bandwidth utilization could also be more efficient.
The benefits extend far beyond using fewer instructions for the same task. ACE is designed to be implementation-agnostic, meaning ML frameworks and their underlying libraries (PyTorch, TensorFlow) only need to write one code path rather than maintain multiple variants based on the underlying hardware and its level of AVX support.
ACE natively supports virtually all data types used in ML operations (including but not limited to INT8, INT32, FP8, FP16, FP32, BF16), but it can also natively use the Open Compute Project's MX block-scaling format, which AVX10 does not provide. Developers are also able to offload some NPU-specific workloads back to the CPU when quick completion of certain tasks is needed. In these scenarios, not having to deal with the fact that every NPU is different is a huge advantage, as ACE provides a consistent target within x86 hardware.