Date:

Here is the rephrased title: “OpenAI’s models ‘memorized’ copyrighted content”

A New Study Lends Credence to Allegations of OpenAI’s Use of Copyrighted Content

OpenAI Embroiled in Suits Over Alleged Use of Copyrighted Works

OpenAI is facing lawsuits from authors, programmers, and other rights-holders who claim that the company used their works without permission to develop its AI models. OpenAI has argued that its use of copyrighted content falls under fair use, but the plaintiffs argue that there is no carve-out in U.S. copyright law for training data.

Study Proposes New Method for Identifying Memorized Training Data

Researchers at the University of Washington, the University of Copenhagen, and Stanford have proposed a new method for identifying training data “memorized” by models behind an API, like OpenAI’s. The study uses the concept of “high-surprisal” words, which are words that stand out as uncommon in the context of a larger body of work.

Methodology

The co-authors probed several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization by removing high-surprisal words from snippets of fiction books and New York Times pieces and having the models try to “guess” which words had been masked. If the models managed to guess correctly, it’s likely they memorized the snippet during training, concluded the co-authors.

Results

According to the results of the tests, GPT-4 showed signs of having memorized portions of popular fiction books, including books in a dataset containing samples of copyrighted ebooks called BookMIA. The results also suggested that the model memorized portions of New York Times articles, albeit at a comparatively lower rate.

Conclusion

The study’s findings shed light on the “contentious data” models might have been trained on. The researchers believe that their work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem.

FAQs

Q: What is the significance of the study’s findings?
A: The study’s findings suggest that OpenAI’s models may have been trained on copyrighted content without permission, which could have legal implications.

Q: What is the method used in the study to identify memorized training data?
A: The study uses the concept of “high-surprisal” words, which are words that stand out as uncommon in the context of a larger body of work.

Q: What are the implications of the study’s findings for OpenAI?
A: The study’s findings could have legal implications for OpenAI, as they suggest that the company may have used copyrighted content without permission.

Q: What does the study suggest about the need for data transparency in the AI ecosystem?
A: The study suggests that there is a real need for greater data transparency in the whole ecosystem, as it is difficult to audit and examine large language models scientifically without it.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here