Microsoft Launches Research Project to Estimate Influence of Training Data on AI Models
Introduction
Microsoft is launching a research project to estimate the influence of specific training examples on the text, images, and other types of media that generative AI models create. The project, which is seeking a research intern, aims to demonstrate that models can be trained in a way that the impact of particular data on their outputs can be "efficiently and usefully estimated".
Background
AI-powered text, code, image, video, and song generators are at the center of a number of IP lawsuits against AI companies. These companies often train their models on massive amounts of data from public websites, some of which is copyrighted. Many of the companies argue that fair use doctrine shields their data-scraping and training practices. However, creatives – from artists to programmers to authors – largely disagree.
Microsoft’s Research Effort
The New York Times sued Microsoft and its partner, OpenAI, in December, accusing the two companies of infringing on The Times’ copyright by deploying models trained on millions of its articles. Several software developers have also filed suit against Microsoft, claiming that the firm’s GitHub Copilot AI coding assistant was unlawfully trained using their protected works.
Collaboration with Jaron Lanier
The research project has the involvement of Jaron Lanier, the accomplished technologist and interdisciplinary scientist at Microsoft Research. Lanier has written about the concept of "data dignity", which involves connecting "digital stuff" with "the humans who want to be known for having made it".
Data Dignity
A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output. For instance, if you ask a model for "an animated movie of my kids in an oil-painting world of talking cats on an adventure", then certain key oil painters, cat portraitists, voice actors, and writers – or their estates – might be calculated to have been uniquely essential to the creation of the new masterpiece. They would be acknowledged and motivated. They might even get paid.
Existing Solutions
Several companies are already attempting to address this issue. AI model developer Bria claims to "programmatically" compensate data owners according to their "overall influence". Adobe and Shutterstock also award regular payouts to dataset contributors, although the exact payout amounts tend to be opaque.
Challenges and Uncertainty
Few large labs have established individual contributor payout programs outside of inking licensing agreements with publishers, platforms, and data brokers. They’ve instead provided means for copyright holders to "opt out" of training. However, some of these opt-out processes are onerous, and only apply to future models – not previously trained ones.
Conclusion
Microsoft’s project may amount to little more than a proof of concept. However, the company’s investigation into ways to trace training data is notable in light of other AI labs’ recently expressed stances on fair use. Several top labs, including Google and OpenAI, have published policy documents recommending that the Trump administration weaken copyright protections as they relate to AI development.
FAQs
Q: What is the purpose of Microsoft’s research project?
A: The project aims to estimate the influence of specific training examples on the text, images, and other types of media that generative AI models create.
Q: What is the scope of the project?
A: The project is seeking a research intern and will demonstrate that models can be trained in a way that the impact of particular data on their outputs can be "efficiently and usefully estimated".
Q: Will Microsoft’s project lead to changes in the way AI models are trained?
A: It is possible that the project will lead to changes in the way AI models are trained, particularly if it is successful in tracing the influence of specific training data.
Q: Will other AI labs follow Microsoft’s lead?
A: It is possible that other AI labs will follow Microsoft’s lead and also launch similar research projects to estimate the influence of training data on AI models.

