AI Web-Crawling Bots: The Cockroaches of the Internet
The Problem
While any website might be targeted by bad crawler behavior, open source developers are disproportionately impacted, writes Niccolò Venerandi, developer of a Linux desktop known as Plasma and owner of the blog LibreNews. This is because open source projects share more of their infrastructure publicly and have fewer resources than commercial products.
The Issue
The problem lies in the fact that many AI bots do not honor the Robots Exclusion Protocol (robot.txt) file, which tells bots what not to crawl. This protocol was originally created for search engine bots. As a result, many AI bots ignore the robot.txt file and continue to crawl and scrape websites, often causing outages and slow performance.
Fighting Back
In response, some developers have started fighting back in ingenuous and often humorous ways. One example is Xe Iaso, a FOSS developer who built a tool called Anubis to block AI crawler bots. Anubis is a reverse proxy proof-of-work check that must be passed before requests are allowed to hit a Git server. It blocks bots but lets through browsers operated by humans.
Enter the God of Graves
The funny part: Anubis is the name of a god in Egyptian mythology who leads the dead to judgment. The project has spread like the wind among the FOSS community, with Iaso sharing it on GitHub on March 19, and in just a few days, it collected 2,000 stars, 20 contributors, and 39 forks.
Vengeance as Defense
The instant popularity of Anubis shows that Iaso’s pain is not unique. In fact, Venerandi shared story after story of other developers experiencing similar issues. One founder CEO of SourceHut, Drew DeVault, described spending "from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale."
Conclusion
In conclusion, AI web-crawling bots are indeed the cockroaches of the internet, and developers are fighting back with cleverness and a touch of humor. While a direct fix may not be possible, these creative solutions provide a sense of justice and a way to defend against these bots.
FAQs
Q: What is the Robots Exclusion Protocol (robot.txt) file?
A: The robot.txt file is a standard protocol that tells bots what not to crawl.
Q: What is Anubis?
A: Anubis is a reverse proxy proof-of-work check that must be passed before requests are allowed to hit a Git server. It blocks bots but lets through browsers operated by humans.
Q: What is the goal of the Nepenthes tool?
A: The goal of the Nepenthes tool is to trap crawlers in an endless maze of fake content, aiming to do exactly that.
Q: What is the purpose of the AI Labyrinth tool?
A: The AI Labyrinth tool is intended to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect ‘no crawl’ directives.

