AI start-up Anthropic accused of ‘egregious’ data scraping

AI start-up Anthropic accused of ‘egregious’ data scraping

Unlock the Editor’s Digest for free

Artificial intelligence start-up Anthropic has been accused of aggressively scraping data from websites to train its systems, potentially breaching publishers’ terms of service in the process, according to those affected.

AI developers rely on ingesting vast quantities of data drawn from a wide variety of sources to create large language models, the technology behind chatbots such as OpenAI’s ChatGPT and Anthropic’s rival, Claude.

Anthropic was founded by a group of former OpenAI researchers on the promise to develop “responsible” AI systems.

However, Matt Barrie, the chief executive of Freelancer.com accused the San Francisco-based company of being “the most aggressive scraper by far” of his portal for freelancers, which has millions of daily visits.

Other web publishers have echoed Barrie’s concerns that Anthropic is swarming their sites and ignoring their instructions to stop collecting their content to train its models.

Freelancer.com received 3.5mn visits from an Anthropic-linked web “crawler” in the space of four hours, according to data shared with the Financial Times. That makes Anthropic “probably about five times the volume of the number two” AI crawler, Barrie said.

Visits from its bot continued to increase even after Freelancer.com attempted to refuse its access requests, using standard web protocols for guiding crawlers, he added. After that, Barrie decided to block traffic from Anthropic’s internet addresses altogether.

“We had to block them because they don’t obey the rules of the internet,” Barrie said. “This is egregious scraping [which] makes the site slower for everyone operating on it and ultimately affects our revenue.”

Anthropic said it was investigating the case and that it respected publishers’ requests and aimed not to be “intrusive or disruptive”.

Scraping publicly available data from across the web is generally legal. But the practice is contentious, can breach websites’ terms of service and can be costly for site hosts.

Kyle Wiens, chief executive of iFixit.com, said his electronic repairs site received 1mn hits from Anthropic bots in the space of 24 hours. “We have a load of alarms [for high traffic], people get woken up at 3am. This set off every alarm we have,” he said.

iFixit’s terms of service prohibited the use of its data for machine learning, said Wiens. “My first message to Anthropic is: if you’re using this to train your model, that’s illegal. My second is: this is not polite internet behaviour. Crawling is an etiquette thing.”

Websites use a protocol known as ‘robots.txt’ to try to keep crawlers and other web robots off portions of their sites. However, it relies on voluntary compliance.

“We respect robots.txt and our crawler respected that signal when iFixit implemented it,” said Anthropic. The company also said its crawlers respected “anti-circumvention technologies” such as CAPTCHAs, and that “our crawling should not be intrusive or disruptive. We aim for minimal disruption by being thoughtful about how quickly we crawl the same domains”.

Data scraping is not a new practice but it has ramped up dramatically in the last two years as a result of the AI arms race. That has imposed new costs on websites.

“AI crawlers have cost us a significant amount of money in bandwidth charges, and caused us to spend a large amount of time dealing with abuse,” wrote Eric Holscher, co-founder of document hosting website Read the Docs in a blog post on Thursday. “AI crawlers are acting in a way that is not respectful to the sites they are crawling, and that is going to cause a backlash against AI crawlers in general,” he added.

Anthropic has created some of the world’s most advanced chatbots — rivalling OpenAI’s ChatGPT — which can respond to an array of prompts in natural language, while positioning itself as a more ethical actor than some rivals. Anthropic’s stated purpose is “the responsible development and maintenance of advanced AI for the long-term benefit of humanity”.

As leading AI companies compete to create evermore capable and dexterous models, they are pushing deeper into untapped corners of the web, partnering with publishers or creating synthetic training data.

OpenAI has struck a number of deals in recent months with publishers and content providers including Reddit, The Atlantic and The Financial Times. Anthropic has not publicly announced similar partnerships.

“The search engines have always done a lot of scraping,” said Barrie, “but it’s gone up a whole level with training generative AI.”

iFixit’s mission “is to give information away”, said Wiens, to encourage people to repair their own. “We’re not opposed to them using our content to train models, we just want to be part of the conversation.”

He added: “I’m not a crusader on this topic, I’m just trying to keep a website online.”