Photo by Levart_Photographer on Unsplash
OpenAI and Anthropic Disregard Web Scraping Rules for Bots
June 24, 2024
Leading AI startups OpenAI and Anthropic are disregarding protocols designed to prevent them from scraping web content without compensating publishers for training their models.
OpenAI, known for the widely used chatbot ChatGPT, has Microsoft as its primary investor, while Anthropic, creator of the popular chatbot Claude, is mainly backed by Amazon.
An analyst at TollBit, a startup aiming to facilitate paid licensing deals among publishers and AI firms, along with another individual familiar with the issue, revealed to Business Insider that both OpenAI and Anthropic have been finding a way around or bypassing established web protocols, specifically the robots.txt standard. This rule is designed to prevent automated scraping of websites.
On Friday, TollBit issued a letter to certain prominent publishers, alerting them to this issue after it came to light that numerous AI firms engaged in similar practices. The correspondence refrained from disclosing the names of any AI companies that participated in this.
However, last week, Perplexity, a firm that describes itself as “a free AI search engine,” faced public scrutiny after Forbes accused it of plagiarizing and distributing its content without authorization across multiple platforms. In a report, Wired disclosed that Perplexity has been scraping content from its website and other publications owned by Condé Nast, disregarding the robots.txt protocol.
Despite OpenAI and Anthropic publicly stating their commitment to honoring robots.txt and blocks for their respective web crawlers, ClaudeBot and GPTBot, TollBit’s findings suggest they have not been true to their word. AI companies, such as OpenAI and Anthropic, are reportedly opting to “bypass” robots.txt to scrape entire content from websites.
Although both OpenAI and Anthropic have not commented on the matter, in May, OpenAI wrote in a blog post on its website that it takes web crawler permissions “into account each time we train a new model.”
Since its introduction in the late 1990s, robots.txt has served as a fundamental piece of code allowing websites to instruct bot crawlers not to scrape and collect their data. It has been embraced widely and, as a result, it has become a foundation of the unwritten rules governing the web.
As generative AI rapidly grows, startups and tech firms are competing to construct cutting-edge AI models. High-quality data is a crucial element in this mission. In this process, the rising demand for such training data has weakened the efficacy of robots.txt.
Last year, several tech firms advocated before the U.S. Copyright Office that web content should be exempt from copyright protections for AI training data. OpenAI has responded by securing agreements with publishers to access their content. The U.S. Copyright Office is scheduled to update its guidelines on AI and copyright later this year.
Recent News
Delta Seeks Outage Damages From Microsoft, CrowdStrike
The airline plans to sue both Microsoft and CrowdStrike for damages.
Sprouts Shares Positive Q2 Financial Results
Sprouts Farmers Market, Inc. reported robust second-quarter results ending on June 30, 2024.
Johnnie Walker Maker, Diageo, Posts Largest Sales Drop Since the Pandemic
As inflation and high interest rates force many to find ways to cut spending, it appears alcohol is also losing its buzz.
IKEA Focuses on Sleepeasy With New Pop-Up Event
IKEA U.S. is making new strides in the furniture retail market by launching The IKEA Sleepeasy, an immersive pop-up event that will take place in New York in August.