OpenAI on a phone

Photo by Levart_Photographer on Unsplash

OpenAI and Anthropic Disregard Web Scraping Rules for Bots

June 24, 2024

Leading AI startups OpenAI and Anthropic are disregarding protocols designed to prevent them from scraping web content without compensating publishers for training their models.

OpenAI, known for the widely used chatbot ChatGPT, has Microsoft as its primary investor, while Anthropic, creator of the popular chatbot Claude, is mainly backed by Amazon.

An analyst at TollBit, a startup aiming to facilitate paid licensing deals among publishers and AI firms, along with another individual familiar with the issue, revealed to Business Insider that both OpenAI and Anthropic have been finding a way around or bypassing established web protocols, specifically the robots.txt standard. This rule is designed to prevent automated scraping of websites.


On Friday, TollBit issued a letter to certain prominent publishers, alerting them to this issue after it came to light that numerous AI firms engaged in similar practices. The correspondence refrained from disclosing the names of any AI companies that participated in this.

However, last week, Perplexity, a firm that describes itself as “a free AI search engine,” faced public scrutiny after Forbes accused it of plagiarizing and distributing its content without authorization across multiple platforms. In a report, Wired disclosed that Perplexity has been scraping content from its website and other publications owned by Condé Nast, disregarding the robots.txt protocol.

Despite OpenAI and Anthropic publicly stating their commitment to honoring robots.txt and blocks for their respective web crawlers, ClaudeBot and GPTBot, TollBit’s findings suggest they have not been true to their word. AI companies, such as OpenAI and Anthropic, are reportedly opting to “bypass” robots.txt to scrape entire content from websites.


Although both OpenAI and Anthropic have not commented on the matter, in May, OpenAI wrote in a blog post on its website that it takes web crawler permissions “into account each time we train a new model.”

Since its introduction in the late 1990s, robots.txt has served as a fundamental piece of code allowing websites to instruct bot crawlers not to scrape and collect their data. It has been embraced widely and, as a result, it has become a foundation of the unwritten rules governing the web.

As generative AI rapidly grows, startups and tech firms are competing to construct cutting-edge AI models. High-quality data is a crucial element in this mission. In this process, the rising demand for such training data has weakened the efficacy of robots.txt.

Last year, several tech firms advocated before the U.S. Copyright Office that web content should be exempt from copyright protections for AI training data. OpenAI has responded by securing agreements with publishers to access their content. The U.S. Copyright Office is scheduled to update its guidelines on AI and copyright later this year.

Recent News