We Need to Learn How to Build and Win This Race

Meta’s Plans to Develop Open-Source AI Models Revealed in Court Documents

Internal Communications Suggest Use of Pirated Data and Efforts to Conceal It

A major copyright lawsuit against Meta has uncovered a trove of internal communications revealing the company’s plans to develop its open-source AI models, including Llama. The messages suggest that Meta used copyrighted data to train its AI systems and worked to conceal it.

"The Goal Needs to be GPT4"

In an October 2023 email, Meta’s vice president of generative AI, Ahmad Al-Dahle, wrote to researcher Hugo Touvron that the company’s goal "needs to be GPT4," referring to the large language model OpenAI announced in March 2023. Meta needed to "learn how to build frontier and win this race."

Using Pirated Data

An undated email from Meta director of product Sony Theakanath to VP of AI research Joelle Pineau discussed the use of Library Genesis (LibGen), a book piracy site, to train its AI systems. Theakanath believed that "LibGen is essential to meet SOTA numbers" and that it was known that OpenAI and Mistral were using the library for their models.

Concealing the Use of Pirated Data

To avoid "media coverage suggesting we have used a dataset we know to be pirated," Theakanath and other Meta employees discussed measures to obscure the copyright information in LibGen’s training data. They suggested removing "more copyright headers and document identifiers" and considering whether to remove a paper’s list of authors "to reduce liability."

Internal Documents

Other internal documents show the measures Meta took to obscure the copyright information in LibGen’s training data. A document titled "observations on LibGen-SciMag" shows comments left by employees about how to improve the dataset. One suggestion is to "remove more copyright headers and document identifiers."

Conclusion

The revelations in these court documents suggest that Meta used copyrighted data to train its AI systems and worked to conceal it. The company’s actions may have been motivated by a desire to stay ahead of rivals like OpenAI and Mistral in the race to develop large language models.

FAQs

Q: What is Meta’s Llama AI model?
A: Meta’s Llama is an open-source AI model developed by the company.

Q: What is Library Genesis (LibGen)?
A: LibGen is a book piracy site that provides access to copyrighted materials.

Q: Did OpenAI and Mistral use LibGen?
A: The companies have not publicly stated whether they use LibGen, but an internal Meta email suggests that they do.

Q: What is the purpose of using pirated data in AI training?
A: The use of pirated data can provide a large amount of training data, which can be beneficial for developing AI models. However, it is illegal and can lead to legal consequences.

Q: What is the current state of AI development?
A: AI development is an active area of research, with companies like Meta, OpenAI, and Google working to develop large language models.

Post Views: 60

We Need to Learn How to Build and Win This Race

The 4 Questions HR Needs to Answer If They Want Teams to Actually Thrive

Generate single title from this title Data Science • AI • Advanced Analytics in 100 -150 characters. And it must return only title i...

MIT student teams win top honors in NASA competition | MIT News

5 Design Considerations for Effective Employee Recognition Programs

Agibot reaches new milestone as its 15,000th humanoid robot rolls off production line

The 4 Questions HR Needs to Answer If They Want Teams to Actually Thrive

Generate single title from this title Data Science • AI • Advanced Analytics in 100 -150 characters. And it must return only title i...

MIT student teams win top honors in NASA competition | MIT News

5 Design Considerations for Effective Employee Recognition Programs

Agibot reaches new milestone as its 15,000th humanoid robot rolls off production line

How AI Navigation is Improving the Performance of Robotic Pool Cleaners

Generate single title from this title SAP aligns commerce data for AI personalisation in 100 -150 characters. And it must return only title i...

Goodwood Festival of Speed unveils Future Lab lineup for 2026

LEAVE A REPLY Cancel reply

Latest

The 4 Questions HR Needs to Answer If They Want Teams to Actually Thrive

Generate single title from this title Data Science • AI • Advanced Analytics in 100 -150 characters. And it must return only title i...

MIT student teams win top honors in NASA competition | MIT News

Categories

Useful Links

Our Newsletter