A followup to how quickly this thread appeared in a generative AI database: iFixit, a DIY repair site that many here probably are familiar with, gets scanned by Anthropic a million times a day. That’s right, 1,000,000. And that’s just by one of the gAI companies.
The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”
iFixit CEO Kyle Wiens tweeted Wednesday “Hey @AnthropicAI: I get you’re hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours? You’re not only taking our content without paying, you’re tying up our devops resources. Not cool.”
Sounds like time for some formal Cease and Desist letters, followed by a lawsuit.
It should be easy to compute a number for damages, since they’re paying for the servers and bandwidth. And if they deliberately ignored an official legal request to stop, that’s now malice and can probably result in a triple-damage award.
Cease and desist for what, exactly? It’s not a copyright violation, as the law currently stands. Websites respond to requests…so a million requests are fine as long as they’re from different people?
Repeatedly downloading an entire server for the purpose of AI training is a violation of both the letter and spirit of iFixit’s terms of service. And the companies running the crawler bots know this.
So, wait, web sites now get to decide what I access their sites for? Just reading is okay, but I shouldn’t use it for “training”? If I watch a YouTube video a couple of times* about changing the oil in my car, is that training?
If you are watching a YouTube video about changing the oil in your car, that’s informative personal use. It’s why the video was created and published.
If a machine downloads it (whether once or a million times) for the purpose of scraping its content and presenting it as its own, that’s copyright infringement A.K.A. stealing.
As mentioned upthread,
The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”
If iFixIt asserts that right without defending it, then Anthropic is basically appropriating it with impunity.
There have been a few court cases on this exact topic (mostly suing OpenAI and Perplexity), but as far as I’m aware, all of them remain unresolved so far.
That’s not quite what the AI is doing, but any in case, courts have tended not to agree so far with your interpretation. Techdirt has a good analysis of the latest dismissal here
Yeah, copyright seems like a stretch as a legal defense. More open to interpretation are the terms of service. Is there any expectation that they are read and agreed to? ToS for websites feel even squidgier than software shrinkwrap agreements.
But imposing a massive load on someone’s server may run afoul of cybercrime laws. One could easily interpret it as a kind of DoS attack, especially if a lot of companies all try to download the entire site simultaneously.
If a million people hit up your web site at once, that’s success. If one bot hits it with the force of a million users, that’s abuse.
I agree it seems like a stretch (and appreciate the link), though it does sound like iFixIt believes their copyrighted material is being appropriated and published without their permission for the benefit of Anthropic.
The problem with ToS is how often we as users ignore them (like, almost always). I’ve taken the time to read a couple of ToS agreements that in the physical world would be 30 pages of 6pt type. That’s a couple of hours in my life that I’ll never recover, despite learning just how completely the creator’s zeal to protect themselves amounts to an attack on my desire to own/lease/license/use their product.
The worst overreach, in my opinion, is the implied consent statement that “By using this [product] you agree to our Terms of Service”…and incorporating the ToS into the site itself. Just by following the link you’ve already agreed without reading, yes?
I think @Shamino has a workable framework, but what’s the line amongst (for example) (1) a healthy user load, (2) a misconfigured server hitting another server 500,000 times, or (3) a bot hitting that same server the same number of times? Has case 2 committed a crime of negligence that should require payment for damages, or is it just the nature of internetworked servers? And with the bot, is scraping for AI so your service can return aggregated parsed results any different from what search engines have been doing for nearly 30 years?
A search engine provides a link to the data’s source. Much like a footnote in a paper.
An AI just presents the data, usually without attribution. What is called plagiarism when humans to it.
If someone designed an AI that could actually understand the data and synthesize an original response based on that understanding, it would be what human researchers do all the time.
But no AI has anything like that. Instead, they parrot back excerpts (maybe short, maybe long) of their training data, and usually without attribution. Which is why there is justification for a copyright violation claim.
But I would argue that it doesn’t matter what the bot is doing with the data. Downloading a web site for AI training or for spamming or as a DoS attack makes no difference to the owner of that web site. All he sees is a massive server load with no users actually viewing anything (users who might see ads or buy merch or otherwise support the site).
And yes, a search engine does the same thing, but at least there, the engine may direct additional (hopefully human) traffic to the site, so there is at least something received in return for the server load.
How far away are we from another Douglas Adams plot coming true:
" The editor, having to meet a publishing deadline, hastily added some footnotes to his copied glossary, in order to avoid prosecution under the Galactic Copyright Laws. A later editor of the Guide sent the book backwards in time in order to sue the company behind Starbix cereal for infringement of these Galactic Copyright Laws."
Yes. I do think it’s different. I recall cases when I was maintaining web sites where 11 or 12 search engine bots would crawl through my sites once or twice a day. It really was a negligible load from my perspective, and it was helpful in surfacing my content to the wider internet, attributed to my sites, and linked back.
The behavior with Anthropic is entirely different in my opinion, because their bot is mining material, certainly using it to train their LLM, and parroting advice that iFixIt invested in developing—time, money, people. Without attribution, mind you.
Maybe, as @ace implies, there is no likely legal remedy for that behavior. Maybe it’s better to see it as @Will_B presented, misbehavior that deserves to be met with infinite links.
“The worst overreach, in my opinion, is the implied consent statement that “By using this [product] you agree to our Terms of Service”…and incorporating the ToS into the site itself. Just by following the link you’ve already agreed without reading, yes?”
Or for the well publicized cases of Disney & Uber trying to absolve themselves from lawsuits because someone once used an unrelated service that was vaguely linked to their organization, such as watching a Disney movie or ordering a pizza to be delivered, in which the ToS prevented them from suing the primary organization. That’s reprehensible use of ToS and it has had me take a look at a couple of apps that I’ve declined to use.