OpenAI’s Bot and the Unintended DDoS: A Case of Web Scraping Gone Awry
On Saturday, Triplegangers CEO Oleksandr Tomchuk was alarmed to find that his company's ecommerce site had crashed, resembling a distributed denial-of-service (DDoS) attack. The cause? A relentless bot from OpenAI attempting to scrape the entirety of their vast website, which hosted over 65,000 products, each with detailed pages and multiple photos.
"OpenAI's bot made tens of thousands of server requests, aiming to download hundreds of thousands of photos with their detailed descriptions," Tomchuk explained. Surprisingly, this bot utilized around 600 different IP addresses, with the potential of more, to probe his site. "Their crawlers were crushing our site," Tomchuk described, likening it to a DDoS attack.
Vulnerabilities in Web Scraping Detection
Triplegangers is a business built on a decade of work, creating what it claims is the largest database of 3D human digital doubles. These files are crucial for 3D artists and game developers who require realistic human models. Despite having a terms of service policy forbidding unauthorized bot access, OpenAI's bot infiltration highlighted a critical vulnerability: the absence of a properly configured robot.txt file.
The robot.txt files, part of the Robots Exclusion Protocol, are designated to instruct search engines on what content to avoid indexing. However, this protocol relies on AI companies voluntarily adhering to these rules, which can take up to 24 hours for bots to recognize a new robot.txt update. Without such files, AI scrapers can interpret the absence of instructions as a free pass to gather data as they please.
For Triplegangers, the breach was more than just an interruption—it had financial implications. The bot's activities during peak US business hours not only brought their site down but also escalated their AWS billing due to increased downloading and CPU usage.
Safeguarding Against Unwanted Scraping
By Wednesday, Tomchuk's team had implemented a robust defense, including a properly configured robot.txt file and a Cloudflare account to block OpenAI's GPTBot, alongside other bots like Barkrowler and Bytespider. However, the company's ability to identify and remove any downloaded content remains limited. As of yet, OpenAI has not responded to inquiries, and it has not delivered a promised opt-out tool that would ostensibly help block unwanted data scraping.
This predicament is especially concerning for Triplegangers, as their business deals with the sensitive rights of real people scanned as part of their offerings. Laws like Europe's GDPR strictly regulate the use of such personal images, underlining the seriousness of unauthorized data extraction.
A Growing Industry-wide Issue
The issue of aggressive web scraping is not isolated. Other websites have faced similar challenges, with digital advertising research from DoubleVerify indicating an 86% rise in non-user-generated web traffic due to AI crawlers and scrapers in 2024. This surge in invalid traffic often goes unnoticed by many sites until they face issues similar to those experienced by Triplegangers.
Tomchuk's experience serves as a cautionary tale for small online businesses, emphasizing the need to proactively monitor web logs to detect AI bot activity. "The model operates similarly to a mafia shakedown," he noted, criticizing AI companies for not seeking proper permissions before data scraping.