Meta’s Plans to Develop Open-Source AI Models Revealed in Court Documents
Internal Communications Suggest Use of Pirated Data and Efforts to Conceal It
A major copyright lawsuit against Meta has uncovered a trove of internal communications revealing the company’s plans to develop its open-source AI models, including Llama. The messages suggest that Meta used copyrighted data to train its AI systems and worked to conceal it.
"The Goal Needs to be GPT4"
In an October 2023 email, Meta’s vice president of generative AI, Ahmad Al-Dahle, wrote to researcher Hugo Touvron that the company’s goal "needs to be GPT4," referring to the large language model OpenAI announced in March 2023. Meta needed to "learn how to build frontier and win this race."
Using Pirated Data
An undated email from Meta director of product Sony Theakanath to VP of AI research Joelle Pineau discussed the use of Library Genesis (LibGen), a book piracy site, to train its AI systems. Theakanath believed that "LibGen is essential to meet SOTA numbers" and that it was known that OpenAI and Mistral were using the library for their models.
Concealing the Use of Pirated Data
To avoid "media coverage suggesting we have used a dataset we know to be pirated," Theakanath and other Meta employees discussed measures to obscure the copyright information in LibGen’s training data. They suggested removing "more copyright headers and document identifiers" and considering whether to remove a paper’s list of authors "to reduce liability."
Internal Documents
Other internal documents show the measures Meta took to obscure the copyright information in LibGen’s training data. A document titled "observations on LibGen-SciMag" shows comments left by employees about how to improve the dataset. One suggestion is to "remove more copyright headers and document identifiers."
Conclusion
The revelations in these court documents suggest that Meta used copyrighted data to train its AI systems and worked to conceal it. The company’s actions may have been motivated by a desire to stay ahead of rivals like OpenAI and Mistral in the race to develop large language models.
FAQs
Q: What is Meta’s Llama AI model?
A: Meta’s Llama is an open-source AI model developed by the company.
Q: What is Library Genesis (LibGen)?
A: LibGen is a book piracy site that provides access to copyrighted materials.
Q: Did OpenAI and Mistral use LibGen?
A: The companies have not publicly stated whether they use LibGen, but an internal Meta email suggests that they do.
Q: What is the purpose of using pirated data in AI training?
A: The use of pirated data can provide a large amount of training data, which can be beneficial for developing AI models. However, it is illegal and can lead to legal consequences.
Q: What is the current state of AI development?
A: AI development is an active area of research, with companies like Meta, OpenAI, and Google working to develop large language models.

