
OpenAI’s latest AI model, o3, has come under fire after independent reviews indicated it performs badly on critical benchmark tests, casting doubt about the company’s transparency and the reliability of its public statements.
Epoch AI’s latest analysis revealed that o3 scored only 10% on the widely used FrontierMath benchmark, substantially lower than OpenAI’s earlier claim that the model could answer more than 25% of the exam. This gap has raised concerns among AI researchers and increased the desire for impartial, third-party evaluations of advanced AI systems.
Benchmark Discrepancy Sparks Industry Debate
Epoch AI, in a post on X, revealed that o3 achieved a score of roughly 10% on their test, which was in sharp contrast to OpenAI’s assertion that the model could handle over 25% of the issues. In the post, they revealed the results and said:
“On FrontierMath, our benchmark of highly challenging, original math questions, o4-mini, with high reasoning, sets a new record in our evaluations, with an accuracy of 17% (±2%)! o3 scores 10% (±2%) with high reasoning, behind o4-mini and o3-mini. On GPQA Diamond, a set of PhD-level multiple choice science questions, o3 scores 82% (±2%), just short of Gemini 2.5 Pro’s 84%, while o4-mini scores 80% (±2%). This matches OpenAI’s reported scores of 83% and 81% for o3 and o4-mini. Both outperform OpenAI’s older reasoning models.”
In December 2024, OpenAI rolled out its o3 language model through livestream, highlighting its advanced reasoning abilities. To make its point, the company reported that the model excelled on tests like FrontierMath.
The FrontierMath benchmark, which was created to assess advanced reasoning and problem-solving capabilities in AI models, has become a litmus test for big language models looking to push the boundaries of artificial general intelligence (AGI). OpenAI’s claim that o3 surpassed GPT-4 on this benchmark sparked interest, until Epoch AI, which helped build the benchmark, publicly challenged the figures.
However, this disparity does not necessarily indicate any wrongdoing by OpenAI. OpenAI’s December benchmarks contained a lower-bound score, which is consistent with Epoch’s findings. Epoch stated that the discrepancy in results could be related to changes in evaluation methodologies, the exact version of FrontierMath employed, or the computational configuration during testing. They noted,
“The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).”
Industry Implications and Calls for Transparency
The gap between OpenAI’s internal performance and third-party analysis has reignited wider concerns about transparency in AI research. As companies compete to show off the abilities of their enormous language models, the lack of specified criteria and public oversight presents tremendous challenges to industry credibility and public confidence.
The OpenAI o3 case demonstrates an emerging concern in the AI industry: inconsistent benchmarking. With increased competition, organizations frequently favor excitement over honesty. Epoch AI was chastised for delaying funding disclosure, xAI was accused of misrepresenting Grok 3 standards, and Meta touted scores for a different model than what was released. These occurrences highlight the need for more transparent, uniform evaluation processes.