OpenAI’s O3 Model Benchmarking Dataset Revealed

Revelations that OpenAI Secretly Funded and Had Access to FrontierMath Benchmarking Dataset Raise Concerns

OpenAI’s Secret Involvement Raises Questions about the Validity of o3 Model’s High Scores

Revelations that OpenAI secretly funded and had access to the FrontierMath benchmarking dataset are raising concerns about whether it was used to train its reasoning o3 AI model, and the validity of the model’s high scores.

Access to the Benchmarking Dataset

In addition to accessing the benchmarking dataset, OpenAI funded its creation, a fact that was withheld from the mathematicians who contributed to developing FrontierMath. Epoch AI belatedly disclosed OpenAI’s funding only in the final paper published on Arxiv.org, which announced the benchmark. Earlier versions of the paper omitted any mention of OpenAI’s involvement.

Screenshot of FrontierMath Paper

[Image: Screenshot of FrontierMath paper]

Closeup of Acknowledgement

[Image: Closeup of acknowledgement]

Previous Version of Paper that Lacked Acknowledgement

[Image: Previous version of paper that lacked acknowledgement]

OpenAI 03 Model Scored Highly on FrontierMath Benchmark

The news of OpenAI’s secret involvement is raising questions about the high scores achieved by the o3 reasoning AI model and causing disappointment with the FrontierMath project. Epoch AI responded with transparency about what happened and what they’re doing to check if the o3 model was trained with the FrontierMath dataset.

Giving OpenAI Access to the Dataset was Unintended

Giving OpenAI access to the dataset was unexpected because the whole point of it is to test AI models, but that can’t be done if the models know the questions and answers beforehand.

Reddit Discussion

A post on the r/singularity subreddit expressed disappointment and cited a document that claimed that the mathematicians didn’t know about OpenAI’s involvement:

"Frontier Math, the recent cutting-edge math benchmark, is funded by OpenAI. OpenAI allegedly has access to the problems and solutions. This is disappointing because the benchmark was sold to the public as a means to evaluate frontier models, with support from renowned mathematicians. In reality, Epoch AI is building datasets for OpenAI. They never disclosed any ties with OpenAI before."

Epoch AI’s Response

Tamay Besiroglu, associated director at Epoch AI, acknowledged that OpenAI had access to the datasets but also asserted that there was a "holdout" dataset that OpenAI didn’t have access to.

Holdout Dataset

Glazer, lead mathematician at Epoch AI, confirmed that OpenAI has the dataset and that they were allowed to use it to evaluate OpenAI’s o3 large language model, which is their next state-of-the-art AI that’s referred to as a reasoning AI model.

More Facts About OpenAI & FrontierMath Revealed

Elliot Glazer, lead mathematician at Epoch AI, confirmed that OpenAI has the dataset and that they were allowed to use it to evaluate OpenAI’s o3 large language model, which is their next state-of-the-art AI that’s referred to as a reasoning AI model.

Conclusion

The drama stands until the Epoch AI evaluation is completed, which will indicate whether or not OpenAI had trained their AI reasoning model with the dataset or only used it for benchmarking it.

FAQs

Q: What is the FrontierMath benchmarking dataset?
A: FrontierMath is a benchmarking dataset for evaluating AI models.

Q: Who funded the creation of the FrontierMath dataset?
A: OpenAI funded the creation of the FrontierMath dataset.

Q: Did OpenAI have access to the FrontierMath dataset?
A: Yes, OpenAI had access to the FrontierMath dataset.

Q: Was OpenAI’s involvement disclosed to the mathematicians who contributed to the development of FrontierMath?
A: No, OpenAI’s involvement was not disclosed to the mathematicians who contributed to the development of FrontierMath.

Q: What is the holdout dataset?
A: The holdout dataset is a separate dataset that is not accessible to OpenAI, which is used to verify the performance of AI models.

Post Views: 48

OpenAI’s O3 Model Benchmarking Dataset Revealed

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter