Here is the rephrased title: “OpenAI’s models ‘memorized’ copyrighted content”

A New Study Lends Credence to Allegations of OpenAI’s Use of Copyrighted Content

OpenAI Embroiled in Suits Over Alleged Use of Copyrighted Works

OpenAI is facing lawsuits from authors, programmers, and other rights-holders who claim that the company used their works without permission to develop its AI models. OpenAI has argued that its use of copyrighted content falls under fair use, but the plaintiffs argue that there is no carve-out in U.S. copyright law for training data.

Study Proposes New Method for Identifying Memorized Training Data

Researchers at the University of Washington, the University of Copenhagen, and Stanford have proposed a new method for identifying training data “memorized” by models behind an API, like OpenAI’s. The study uses the concept of “high-surprisal” words, which are words that stand out as uncommon in the context of a larger body of work.

Methodology

The co-authors probed several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization by removing high-surprisal words from snippets of fiction books and New York Times pieces and having the models try to “guess” which words had been masked. If the models managed to guess correctly, it’s likely they memorized the snippet during training, concluded the co-authors.

Results

According to the results of the tests, GPT-4 showed signs of having memorized portions of popular fiction books, including books in a dataset containing samples of copyrighted ebooks called BookMIA. The results also suggested that the model memorized portions of New York Times articles, albeit at a comparatively lower rate.

Conclusion

The study’s findings shed light on the “contentious data” models might have been trained on. The researchers believe that their work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.

FAQs

Q: What is the significance of the study’s findings?
A: The study’s findings suggest that OpenAI’s models may have been trained on copyrighted content without permission, which could have legal implications.

Q: What is the method used in the study to identify memorized training data?
A: The study uses the concept of “high-surprisal” words, which are words that stand out as uncommon in the context of a larger body of work.

Q: What are the implications of the study’s findings for OpenAI?
A: The study’s findings could have legal implications for OpenAI, as they suggest that the company may have used copyrighted content without permission.

Q: What does the study suggest about the need for data transparency in the AI ecosystem?
A: The study suggests that there is a real need for greater data transparency in the whole ecosystem, as it is difficult to audit and examine large language models scientifically without it.

Post Views: 53

Here is the rephrased title: “OpenAI’s models ‘memorized’ copyrighted content”

A New Study Lends Credence to Allegations of OpenAI’s Use of Copyrighted Content

OpenAI Embroiled in Suits Over Alleged Use of Copyrighted Works

Study Proposes New Method for Identifying Memorized Training Data

Methodology

Results

Conclusion

FAQs

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

Generate single title from this title It exposed what was already broken in 100 -150 characters. And it must return only title i dont...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Categories

Useful Links

Our Newsletter