Date:

Harvard Is Releasing a Massive Free AI Training Dataset

Harvard University Releases Massive Public Domain Book Dataset for AI Training

Harvard’s Institutional Data Initiative Announces New Dataset

Harvard University announced Thursday that it is releasing a high-quality dataset of nearly one million public-domain books that can be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI.

A Game-Changer for AI Research and Development

The dataset, which is around five times the size of the notorious Books3 dataset, spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. According to Greg Leppert, executive director of the Institutional Data Initiative, the project aims to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to highly-refined and curated content repositories that normally only established tech giants have the resources to assemble.

Using the Dataset for AI Development

Leppert believes that the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. "I think about it a bit like the way that Linux has become a foundational operating system for so much of the world," he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

Microsoft’s Support for the Project

Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating "pools of accessible data" for AI startups to use that are "managed in the public’s interest." Davis noted that Microsoft uses publicly available data for the purposes of training its own models and does not plan to swap out all of its AI training data with public domain alternatives like the books in the new Harvard database.

The Future of AI Development

As dozens of lawsuits filed over the use of copyrighted data for training AI wind their way through the courts, the future of how artificial intelligence tools are built hangs in the balance. If AI companies win their cases, they will be able to keep scraping the internet without needing to enter into licensing agreements with copyright holders. But if they lose, AI companies could be forced to overhaul how their models are made. A wave of projects like the Harvard database are plowing forward under the assumption that there will be an appetite for public domain datasets.

Additional Initiatives

In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it is open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled, but Harvard has asked Google to work together on public distribution, although the search giant has not publicly agreed to host it yet.

Conclusion

The release of this massive public domain book dataset is a significant step forward in making high-quality training data more accessible to the AI community. As the future of AI development hangs in the balance, projects like this one are crucial in providing a foundation for innovation and progress in the field.

FAQs

Q: What is the size of the dataset?
A: The dataset is around five times the size of the notorious Books3 dataset.

Q: What kind of content is included in the dataset?
A: The dataset includes books from various genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries.

Q: How will the dataset be released?
A: The exact way the books dataset will be released is not settled, but Harvard has asked Google to work together on public distribution, although the search giant has not publicly agreed to host it yet.

Q: What is the significance of this project?
A: The project aims to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to highly-refined and curated content repositories that normally only established tech giants have the resources to assemble.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here