JAKARTA OpenAI launched the basic model of Artificial Intelligence (AI) o3 in December 2024. After several months of launch, this AI model became a concern because of the results of benchmark testing from third parties.
OpenAI testing showed high results so that o3 was claimed to be better than Grok 3. To prove its capabilities, OpenAI said that o3 was able to answer more than a quarter about FrontierMath, a mathematical benchmark.
This result is certainly much better than its competitors who only managed to control 2 percent of FrontierMath. In OpenAI's live broadcast when launching the o3, the company also stated that its AI benchmarks reached more than 25 percent.
However, the results of the Epoch AI test actually showed different results. The research institute behind FrontierMath shared the results of the o3 trial on April 18, 2025. The score obtained by the AI model was only 10 percent, 15 percent lower than promised.
OpenAI may not lie about the results of its benchmarks because the score that Epoch shared was the lower limit score that OpenAI noted. Epoch also explained that the difference in the results of this trial may have been caused by a different version of FrontierMath.
The difference between our results and OpenAI's results may be due to OpenAI evaluating with stronger internal devices, using more testing times or because those results are run on different FrontierMath subsets, Epoch wrote in the O3 testing report.
SEE ALSO:
On the other hand, an organization that participated in testing the pre-release of the o3 actually bought the results of the Epoch test. The organization called the ARC Prize Foundation, citing from TechCrunch, said that the current public o3 model is indeed different.
This means that the results of the Epoch test did not experience any mistakes. On the other hand, ARC Prize turned on OpenAI because the results of the o3 test were still pre-release and after it was released it beat the difference.
"(O3 public) is a different model adapted for chat/product use," said ARC Prize on its official X account. "All levels of o3 computing released are smaller than the version we (previously tested)."
The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)