poch AI has released FrontierMath, a new evaluation benchmark designed to probe the mathematical reasoning limits of the most capable large language models. The results are sobering: across 22 frontier models including GPT-4o, Claude 3.5 Sonnet, and Gemini Ultra, average performance on the hardest problem sets sits below 3%.

Models Tested22Including GPT-4o, Claude 3.5, Gemini
Avg Score (hard set)<3%Across all frontier models
Problem domains7+From number theory to topology
VerificationHumanAll answers verified by mathematicians

What Makes FrontierMath Different

Unlike existing math benchmarks such as MATH or AIME, which models have largely saturated through training data memorisation, FrontierMath comprises novel problems designed by professional mathematicians specifically to resist pattern-matching. Problems range from number theory to algebraic topology, and require multi-step reasoning chains that current transformer architectures appear ill-equipped to execute reliably.

You either get it right or you do not. There is no partial credit for plausible-sounding wrong reasoning.

Epoch AI research notes

Reigniting the Reasoning Debate

The benchmark's publication has reignited debate about what 'reasoning' actually means in the context of large language models. Proponents of the 'stochastic parrot' critique argue that the poor performance validates their view that these models are sophisticated interpolators, not genuine reasoners. Defenders of scaling contend that the benchmark is simply beyond the current capability frontier.

For enterprises deploying AI in scientific and engineering workflows, the implications are practical. Current models are useful for code generation, literature synthesis, and pattern analysis — but for tasks requiring formal mathematical proof or novel derivation, they remain unreliable. Epoch AI plans to update the benchmark annually as model capabilities evolve.