Home › Chips & Hardware › Report
Chips & Hardware · Report
Meta's 16,384-H100 Llama 3 training cluster experienced ~50% failure rate from faulty H100 GPUs and HBM3 memory, averaging one failure every three hours.
Exposes critical reliability gaps in Nvidia's H100 production quality, raising serious concerns about hardware durability for large-scale training infrastructure.
Trade pressSlicast · July 27, 2024 · Global · Source: tomshardware.com
importance 75Meta's 16,384-H100 Llama 3 training cluster experienced ~50% failure rate from faulty H100 GPUs and HBM3 memory, averaging one failure every three hours.