Date:

OpenAI’s Bot Crushes Website Like a DDoS Attack

Article

On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was down. It looked to be some kind of distributed denial-of-service attack.

The Culprit: OpenAI Bot

He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site.

The Scale of the Attack

"We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions.

The Impact on the Business

"OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it’s way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said. "It was basically a DDoS attack."

The Business: Triplegangers

Triplegangers’ website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos — everything from hands to hair, skin, and full bodies — to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics.

The Problem with Robot.txt

Tomchuk’s team, based in Ukraine but also licensed in the U.S. out of Tampa, Florida, has a terms of service page on its site that forbids bots from taking its images without permission. But that alone did nothing. Websites must use a properly configured robot.txt file with tags specifically telling OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, that have their own tags, according to its information page on its crawlers.)

The Solution: Proper Configuration

Robot.txt, otherwise known as the Robots Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its informational page that it honors such files when configured with its own set of do-not-crawl tags, though it also warns that it can take its bots up to 24 hours to recognize an updated robot.txt file.

The Aftermath

To add insult to injury, not only was Triplegangers knocked offline by OpenAI’s bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.

The Consequences of the Attack

Robot.txt also isn’t a failsafe. AI companies voluntarily comply with it. Another AI startup, Perplexity, pretty famously got called out last summer by a Wired investigation when some evidence implied Perplexity wasn’t honoring it.

The Conclusion

The problem is that the onus is on the business owner to understand how to block the bots, and many small online businesses may not have the resources or expertise to do so. The solution is for OpenAI and other AI companies to ask for permission instead of scraping data, and for small businesses to be aware of the risks and take steps to protect themselves.

FAQs

Q: What is the purpose of robot.txt?
A: Robot.txt is a protocol that tells search engine sites what not to crawl as they index the web.

Q: How does OpenAI honor robot.txt files?
A: OpenAI honors robot.txt files when configured with its own set of do-not-crawl tags, but it can take up to 24 hours to recognize an updated robot.txt file.

Q: What is the impact of the attack on Triplegangers?
A: The attack knocked Triplegangers offline and is expected to result in a jacked-up AWS bill due to the CPU and downloading activity from the bot.

Q: What is the solution to the problem?
A: The solution is for OpenAI and other AI companies to ask for permission instead of scraping data, and for small businesses to be aware of the risks and take steps to protect themselves.

Q: What is the scale of the problem?
A: The scale of the problem is massive, with an 86% increase in "general invalid traffic" in 2024, according to new research from digital advertising company DoubleVerify.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here