Date:

Meta staffers discussed using copyrighted content for AI training

Meta Employees Discussed Using Copyrighted Works to Train AI Models

For years, Meta employees have internally discussed using copyrighted works obtained through legally questionable means to train the company’s AI models, according to court documents unsealed on Thursday.

Internal Discussions

The documents were submitted by plaintiffs in the case Kadrey v. Meta, one of many AI copyright disputes slowly winding through the U.S. court system. The defendant, Meta, claims that training models on IP-protected works, particularly books, is "fair use." The plaintiffs, who include authors Sarah Silverman and Ta-Nehisi Coates, disagree.

Internal Chats

Previous materials submitted in the suit alleged that Meta CEO Mark Zuckerberg gave Meta’s AI team the OK to train on copyrighted content and that Meta halted AI training data licensing talks with book publishers. But the new filings, most of which show portions of internal work chats between Meta staffers, paint the clearest picture yet of how Meta may have come to use copyrighted data to train its models, including models in the company’s Llama family.

Using Pirated Books for Training

In one chat, Meta employees, including Melanie Kambadur, a senior manager for Meta’s Llama model research team, discussed training models on works they knew may be legally fraught.

Libgen Talks

In another work chat, Kambadur discusses possibly using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to data sources that Meta might license.

Legal Exposure

Some decision-makers within Meta appear to have been under the impression that failing to use Libgen for model training could seriously hurt Meta’s competitiveness in the AI race, according to the filings.

Mitigations

Theakanath outlined "mitigations" in an email intended to help reduce Meta’s legal exposure, including removing data from Libgen "clearly marked as pirated/stolen" and not publicly citing usage. "We would not disclose use of Libgen datasets used to train," as Theakanath put it.

Conclusion

The filings contain other revelations, implying that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit said in April 2023 that it planned to begin charging AI companies to access data for model training.

FAQs

Q: What is the case Kadrey v. Meta about?
A: The case is about the use of copyrighted works to train AI models.

Q: What is Libgen?
A: Libgen is a "links aggregator" that provides access to copyrighted works from publishers.

Q: Is Libgen legal?
A: No, Libgen has been sued multiple times, ordered to shut down, and fined tens of millions of dollars for copyright infringement.

Q: What is the implication of Meta’s actions?
A: The implication is that Meta may have used copyrighted works to train its AI models, potentially violating copyright law.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here