Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

AI Models and Copyrighted Content: A New Paper Accuses OpenAI of Training on Non-Public Books

Training AI Models on Copyrighted Content

AI models are complex prediction engines that learn patterns and novel ways to extrapolate from a simple prompt. While a number of AI labs have begun embracing AI-generated data to train AI, few have eschewed real-world data entirely. This is because training on purely synthetic data comes with risks, like worsening a model’s performance.

The New Paper

The AI Disclosures Project, a nonprofit co-founded by Tim O’Reilly and Ilan Strauss, has released a new paper that accuses OpenAI of training its GPT-4o model on paywalled books from O’Reilly Media. The paper concludes that OpenAI likely trained its GPT-4o model on non-public books from O’Reilly Media, without a licensing agreement.

Methodology

The paper used a method called DE-COP, designed to detect copyrighted content in language models’ training data. The method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.

Results

The results of the paper show that GPT-4o "recognized" far more paywalled O’Reilly book content than OpenAI’s older models, including GPT-3.5 Turbo. This is even after accounting for potential confounding factors, such as improvements in newer models’ ability to figure out whether text was human-authored.

Conclusion

While the co-authors of the paper are careful to note that their experimental method isn’t foolproof, the results suggest that OpenAI may have trained its GPT-4o model on non-public O’Reilly books without a licensing agreement. This is a serious accusation, and OpenAI’s lack of comment on the matter only adds to the concerns.

FAQs

Q: What is the AI Disclosures Project?
A: The AI Disclosures Project is a nonprofit organization co-founded by Tim O’Reilly and Ilan Strauss to promote transparency and accountability in the development of AI models.

Q: What is DE-COP?
A: DE-COP is a method designed to detect copyrighted content in language models’ training data. It tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text.

Q: What are the implications of this paper?
A: The implications are serious, as they suggest that OpenAI may have trained its GPT-4o model on non-public O’Reilly books without a licensing agreement. This could have significant legal and ethical implications for the company and the broader AI industry.

Q: How did the paper arrive at its conclusions?
A: The paper used a combination of methods, including DE-COP, to test whether OpenAI’s GPT-4o model had prior knowledge of paywalled O’Reilly book content.

Post Views: 39

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter