OpenAI Reviews Evidence of Data Harvesting by Chinese Start-up DeepSeek
OpenAI’s Terms of Service Allegedly Breached
OpenAI, a San Francisco-based start-up valued at $157 billion, is reviewing evidence that DeepSeek, a Chinese company, may have harvested large amounts of data from its AI technologies to teach similar skills to its own systems. This process, known as distillation, is common in the AI field, but OpenAI’s terms of service prohibit anyone from using data generated by its systems to build technologies that compete in the same market.
"We know that groups in the P.R.C. are actively working to use methods, including what’s known as distillation, to replicate advanced U.S. AI models," said OpenAI spokeswoman Liz Bourgeois. "We are aware of and reviewing indications that DeepSeek may have inappropriately distilled our models, and will share information as we know more."
DeepSeek’s A.I. Technologies
DeepSeek, which has sent shockwaves through Silicon Valley, has been making waves by releasing AI technologies that match the performance of anything else on the market. The prevailing wisdom had been that the most powerful systems could not be built without billions of dollars in specialized computer chips, but DeepSeek claims to have created its technologies using far fewer resources.
AI Companies and Open Sourcing
AI companies like DeepSeek and OpenAI rely heavily on a practice called open sourcing, freely sharing code that underpins their technologies and reusing code shared by others. They see this as a way of accelerating technological development. AI companies also need massive amounts of online data to train their systems, which learn their skills by pinpointing patterns in text, computer programs, images, sounds, and videos.
Distillation and Data Harvesting
Distillation is often used to train new systems. While it is allowed by open source technologies, it may be legally problematic if a company takes data from proprietary technology. OpenAI’s terms of service explicitly prohibit anyone from using data generated by its systems to build technologies that compete in the same market.
Lawsuits Against OpenAI
OpenAI is facing more than a dozen lawsuits accusing it of illegally using copyrighted internet data to train its systems. One of these lawsuits was brought by The New York Times against OpenAI and its partner Microsoft, alleging that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information. Both OpenAI and Microsoft deny the claims.
Conclusion
The situation highlights the complex issues surrounding AI development and data harvesting. While OpenAI’s terms of service prohibit data harvesting, the company’s own practices have been called into question. The case is ongoing, and the outcome will have significant implications for the AI industry.
FAQs
Q: What is distillation in AI?
A: Distillation is a process used to train new AI systems by using data generated by existing systems.
Q: What is open sourcing in AI?
A: Open sourcing is the practice of freely sharing code that underpins AI technologies and reusing code shared by others to accelerate technological development.
Q: What are the implications of OpenAI’s terms of service breach?
A: The breach could have significant implications for the AI industry, including the potential for legal action and reputational damage.
Q: What is the current status of the lawsuit against OpenAI and Microsoft?
A: The lawsuit is ongoing, with both parties denying the claims.