AI model ran for 19 days on MirrorCode benchmark at $2,600 cost, revealing inference economics.
Epoch AI and METR have developed MirrorCode, a new coding benchmark that tests whether AI models can reimplement complete programs from scratch without access to the original source code. The benchmark includes 25 target programs spanning Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each AI-generated solution must exactly reproduce the output of the original program, including hidden end-to-end tests that the model encounters only during evaluation.
A distinctive feature of MirrorCode is its generous inference budget. While many software engineering benchmarks cap costs at $1 to $10 per task, MirrorCode allows substantially higher expenditures. One of the largest tasks cost $2,600 for a single run, during which the AI worked continuously for 19 days with no human intervention.
Claude Opus 4.7 delivered the standout performance, reimplementing gotree—a bioinformatics toolkit comprising approximately 16,000 lines of Go code and over 40 commands. A human engineer working without AI assistance would require 2 to 17 weeks to complete the same work, according to the researchers. Opus 4.7 finished in 14 hours for $251.
Overall, Claude Opus 4.7 achieved a 56 percent solve rate, followed by GPT-5.5 at 44 percent and Gemini 3.1 Pro Preview at 32 percent. Even when models fail to fully reimplement a program, they typically pass 90 percent or more of the tests. The benchmark reveals clear difficulty stratification: small programs like uuid and parseqsv get reliably reimplemented by all tested models, while the largest tasks remain unsolved across the board.
Progress is accelerating. Leading models from a year ago would have achieved only about 30 percent and been limited to simpler programs such as a calendar utility. Cost trends, however, do not follow a consistent pattern—GPT-5.5 costs three times as much as GPT-5 for identical tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1.
Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages, with three programs retained for testing purposes. The researchers acknowledge an important caveat: since MirrorCode uses open-source programs as targets, models may have encountered the original code during training. Their initial testing suggests that "the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance."