· 01:08
Today we’re diving into how Tesla keeps its massive Dojo supercomputers running smoothly, even when a single glitch can ruin a “weeks-long AI training run.” Each Dojo wafer-scale processor—called a “Training Tile”—packs up to 8,850 custom RISC-V cores. At full scale, a Dojo cluster spans millions of cores, all churning through data at breakneck speed. To catch sneaky silent data corruptions without taking machines offline, Tesla created the Stress tool. It sends random instruction payloads between cores and even uses XOR checks to boost fault detection “by a factor of 10.” Tests run in the background so only the bad cores are quietly disabled. Stress has already uncovered rare design flaws and low-level software bugs, which engineers swiftly patched. Now fully integrated into Dojo clusters, Stress keeps Tesla’s defect rates on par with Google and Meta—and helps the company study hardware aging and pre-silicon validation down the road.
Link to Article
Listen to jawbreaker.io using one of many popular podcasting apps or directories.