Why AI Web Scraping (Mostly) Doesn’t Bother Me

ace · July 8, 2024, 10:18pm

Originally published at: Why AI Web Scraping (Mostly) Doesn’t Bother Me - TidBITS

Despite the seemingly universal outrage about tech companies scraping the open Web to train their models, Adam Engst finds himself largely unperturbed.

ron · July 8, 2024, 11:58pm

An excellent and nuanced take on this subject that led me to examine my reactions a little more closely.

I do think there are reasons to remain vigilant and concerned. The now infamous “eat a rock a day” and “put glue in your pizza sauce” advice should give us pause because the LLM chatbots returned results that were obviously derived from a single source, yet presented those results as though they were some kind of consensus. I think that should be concerning both to authors and consumers of content.

The advent of mass-market printing, newspapers, radio, television, blogs, and social media giants all presented both opportunities and new challenges to writers and readers alike. One of our favorite throw-away lines around my house is “and it was on the internet, so I know it’s true.”

We’ve had to adjust, over and over, to the notion that the content we consume is crafted and delivered in ways that serve the desires of many different actors, and consider all such media critically. Someone who contents themselves with consumption of solely AI-generated content will have an impoverished existence, just as someone who did the same with commercial televised content or even published books.

josehill · July 8, 2024, 11:59pm

A thoughtful piece! Thanks for writing it!

xdev · July 9, 2024, 12:24am

Excellent article, Adam! I fully agree. I long ago (back in the 1990s) concluded that any content I didn’t want in the de facto “public domain” had to remain behind a paywall. That doesn’t mean I still don’t own the copyright, but I can’t really complain if someone uses my content without credit or still bits of it.

I do find AI scraping to be suspect if the training text is behind a paywall (like perhaps the NYT which requires a subscription). To me that shouldn’t be used for training. If AI robots are doing that, the owners should be sued and lose.

Tech blogger Manton Reece posted an interesting take on this today which expressed a lot of my own ideas and questions on training. For example, humans learn by reading, so why shouldn’t robots?

Manton Reece - Training C-3PO

And humans sometimes accidentally plagiarize. Who hasn’t written a paper in school where you discovered you’d quoted a line or two from an encyclopedia verbatim without meaning to do that?

Lots of fascinating questions.

c_j · July 9, 2024, 2:19am

This is another variation of asking for forgiveness rather than permission. And in almost all cases, the former is lauded as ‘easier’ or even ‘better’. I am a regular user of iNaturalist - which is a very sophisitcated database of living things on the planet. It has an excellent AI that is supplemented by many human experts. I freely add data to this - and I put in quite a bit of work to do this. But that particular project is one where you give your explicit permission to participate There are also ways to obscure sensitive content - like exact location or even the identifier - if that is what is best for you, or for the endangered organism. Wouldn’t you rather have that model?

donmunro · July 9, 2024, 7:15am

Doesn’t it depend on the time scale and what is being done with the information - especially if it is being sold back to the author? A case in point is the current row between Facebook and Australian publishers, in which the former is refusing to pay for news stories generated by the latter, but “scraped” and “sold” to Facebook subscribers. The effect is this is that Australian journalists are being laid off (30 staff from 9News threatened this week, because the company is losing the fees it normally collects).
Part of the equation seems to be the time scale - nobody minds if the news is collected to eventually end up as history in a public medium, but if its IMMEDIACY is being sold without attribution or compensation we have an issue. Another element is the reach of the law - anyone who “steals” news from a US source can easily be pursued there, but an international pursuit is expensive if not impossible.

tony2 · July 9, 2024, 12:51pm

Your last paragraph gave me a chuckle. It also made me think of the distinctions we fail to make between historical knowledge and current needs. Much of the Tidbits archive is about tech that is now historic, rather than tech for current use. History has its uses, sure, but what we need to know about our current tech has an undeniable precedence. Will AI develop into something that can install, check, and troubleshoot tech we use today in any meaningful way? Perhaps, but it seems still yet to come.

Meanwhile, as a detailed repository of historic value, it can serve a purpose that could help inform later decisions positively, perhaps even more so when summarized by AI, and subsequently interpreted and understood by natural intelligence. So I had to chuckle at the idea of a Tidbits chatbot telling tales of the horrors of System 7 to schoolchildren around an electric campfire in the near future!

ron · July 9, 2024, 1:37pm

A thing I have long found annoying is publishers who deliberately make their content available to scrapers so that it shows up in, say, search results, but then throw up a paywall when I try to access it as an end user. That’s a bait-and-switch tactic (“come to my site, here’s the information you want!” then “oh, now that you’re here, you can’t see it until you pay for it”) and wastes my time with what is basically an advertising gambit. So if paywalled sites got scraped by AI trainers because they were trying to dupe readers into paying for their content, I’ll not shed a tear if their deception opened the door for others to use their content in ways they might not want.

adrian · July 9, 2024, 1:39pm

One aspect you don’t mention, Adam, is the wholesale theft by LLMs of content that is not publicly available without payment.

I have written three popular non-fiction books (on meeting design). Copies of these books have been made available on pirate websites for download. ChatGPT has scraped pirate copies of my books without paying me a penny. I know this because I can ask ChatGPT to summarize a chapter that has never been made publicly available and it will provide a detailed summary that could only have been made if the book had been added to their database. OpenAI refuses to share the sources of its training data, but my experience shows that the company has no problem using stolen paid content. I am just one of thousands of authors whose hard work has been appropriated by OpenAI.

I have also published 800+ blog posts over the last 14 years, and ChatGPT has those too. Like you, I have no problem with this except that ChatGPT’s reformulation of them contains so many inaccuracies and distortions that I have to conclude that anyone who is looking for reliable information on a topic, at least in my areas of expertise, is ill-served by anything that ChatGPT spits out.

Simon · July 9, 2024, 3:57pm

This is an excellent example why it’s naive to brush away lack of source declaration (or claim it has more to do with “manners” than business).

Imagine if an LLM provider like ChatGPT faced regulation that said a) every answer has to specify the source(s) and b) you could then require ChatGPT to scrub their models’ training data of material derived from your book since the source identifies it as illegal.

And before the usual cop-outs about how this can’t be done, let me add this. Right now on our very own campus we’re testing an experimental LLM toolkit by Google that aims at summarizing and synthesizing large bodies of scientific content. Not only will it tell me the exact sources for all its answers, it also allows removing a specific source.

Shamino · July 9, 2024, 4:08pm

I agree, but I’ll also point out that many “paywalls” are pathetic and are implemented in a way that may not affect a scraper-bot, even it it doesn’t have deliberate circumvention code.

For example, the New York Times allows a small number (3 or 5, I think) of free views before they ask you to log in. They are almost certainly using cookies to track your visits, since anything else would be useless without a login. Scraper software probably doesn’t support cookies at all, or if it does, it probably discards them between fetches, because they would just be unnecessary overhead.

And I’ve seen plenty of “paywalls” where the entire article is downloaded, and the content is then covered with a giant popup asking you to pay. Putting a browser into “reader” mode or disabling scripting is all you need to do to “circumvent” this mechanism. And again, I would not expect a scraper-bot to run scripts unless strictly necessary for retrieving content, and even then, it will probably be configured to run only those absolutely necessary, simply because it would be an unnecessary waste of CPU cycles. And if the content is already downloaded, then the bot has it, whether or not a script might later cover it with a popup.

Actual paywalls, where nobody gets content (or no more than a few lines of teaser content) without a login do exist, but they are far less common than the useless window dressing scripts that I usually see when browsing.

Absolutely true. The difference is that your third-grader’s homework assignment isn’t published as an original research paper. And by the time he grows up to the point where he may be publishing, he has almost certainly been taught about the laws and ethics of plagiarism. (He might still plagiarize, but almost certainly not by mistake.)

Ow. Yes. This is another vector that hasn’t been discussed so far.

These training-scrapers may have to have code to block pirated content and not just blindly follow all links wherever they may go (as a search engine probably should do).

And not just for copyright reasons. Also because pirated content is often corrupted (e.g. software with malware installed) and in ways that a bot probably can’t detect. You really don’t want it producing malicious answers because it trained itself from malicious source material.

james.e.reynolds · July 9, 2024, 4:43pm

InvokeAI just did an interview with a copyright lawyer and it’s really relevant to this discussion.

james.e.reynolds · July 9, 2024, 5:21pm

I’m still listening to the video, but this is the most powerful statement so far regarding fair use and what I think people are actually upset about: “You’re not just producing information about these works [of art, music, and writing] you’re using these works to create a substitute for these works.” In other words, AI models are taking data that provides income or some other value to the data’s original author, and the AI models are creating good enough copies to supplant the original data and the original author no longer gets value from their creative work.

Personally, I think the issue is bigger though, bigger than AI, and that is that what people consider valuable is changing dramatically. In this video, Rick Beato lays out an argument that music is no longer valuable, and not because of AI.

ace · July 9, 2024, 6:40pm

The line between fact and fiction, reality and parody, was so blurred so long ago that I couldn’t get too exercised about those examples. The glue one was at least unprovoked, but when you ask how many rocks you should eat, it’s easy to see getting a snarky response back. It just keeps coming down to having to evaluate everything you read.

Fascinating take from Manton Reece, indeed!

It’s certainly one approach, but I’m not sure it’s one that makes sense outside its specific application. Text doesn’t just exist on its own in the natural world—people have to create it. And in this context, they have to make it available for others to read, which feels like giving permission to me.

I haven’t been following this closely, not being from Australia and being generally allergic to anything associated with Facebook. But based on some quick searches, it sounds like Facebook was only posting headlines with links to the original articles. I realize there were some payments from Meta to content companies, but wouldn’t that generally fall under the implied contract of “you can scrape my content if you send me traffic”?

It does feel different, but Perplexity does cite its sources and link back to them, being more of a search engine than a chatbot, and while I don’t have the exact dates on hand, the chatbots are generally a year or so out of date. So that doesn’t seem like a huge issue at the moment.

An age-old Internet issue!

That is an issue, though it would seem that the fault lies more with the pirate websites than with ChatGPT’s bot, which would have no way of distinguishing a legitimately uploaded book (by the author, or in the public domain) from one like yours that was stolen.

I actually thought about mentioning this because back when we were publishing Take Control, I fought with the pirate websites constantly. From 2010 through 2015 (mostly through 2012), I sent 287 DMCA takedown notices. They nominally worked, but it was complete whack-a-mole. Some of the sites even told me they didn’t actually have the files; they just generated a faux download link based on searches. As annoying as this was, I never had the impression that it actually cut into sales in a real way.

In our case, the content also went out of date fairly quickly, so even if those books were used for training, they wouldn’t have been revealing anything that was still for sale. It sounds like your books might be more evergreen.

The real question, though, is if the fact of your books being in the training model actually affects your sales. Do you have any evidence that people are using ChatGPT and getting your content in such a way that they don’t buy your book? (Is it even possible to know that?) I’m not asking to be annoying—these were the sort of things I tried to figure out for Take Control whenever something I didn’t like happened. In that situation, the answer was always that the effect was minimal and that we had to focus on serving our most loyal readers.

See my points about chatbot responses being C+ work… I’m curious how you prompted ChatGPT to reformulate them and got distortions and inaccuracies. I tried it with my big article on Dark Mode and found that ChatGPT conflated the article text with the comments, thus somewhat confusing the matter until I asked it to separate the two. Interestingly, Perplexity did not make that mistake. Both ChatGPT and Perplexity properly linked to the original article.

But it still feels like an artificial example. If I were an average person wondering about Dark Mode, I wouldn’t be asking about specific articles, and when I did that, I got a pretty balanced view and ChatGPT was happy to go into more depth on the criticisms.

I always wonder what counts as a substitute, especially one that can compete commercially? And you’re absolutely right about the issue of what people consider to be valuable.

Another pre-AI story. When Apple opened the iBookstore, we published Joe Kissell’s Take Control of iCloud there, and it was a huge hit at $14.99 for several hundred pages. 1500 copies sold in the first month, I seem to remember. Shortly afterward, however, someone published a book about iCloud that was much, much shorter, but also much cheaper—perhaps $1.99. I can’t remember (or find) all the details, but iBookstore sales for Take Control of iCloud tanked and never recovered. The problem was purely price—just as with the App Store, Apple created a marketplace that drove prices to effectively zero, and price was more important to buyers than quality that they presumably couldn’t determine in advance.

But the continued success of the book to the Take Control audience showed that the most important thing is to find the right audience. The iBookstore audience wasn’t right for us.

Will_M · July 9, 2024, 7:58pm

Not quite on topic.

And I appreciate it, especially the part that enriches me. Thank you.

adrian · July 9, 2024, 8:00pm

Obviously, the pirate websites are the first offenders. But Open AI made the unethical choice of scraping content from them! They didn’t need to make that shameful choice. At a minimum, Open AI could have purchased a single copy of each book, as libraries do, giving them at least a (wobbly) leg to stand on.

I don’t see how I (or Open AI) would be able to determine that ChatGPT users obtained content derived from my books, directly leading to lost sales. So I’ll never know. By its actions, Open AI is essentially saying, “You can’t prove that any harm was done to you by our stealing your content, so it was OK to steal it.”

It’s easy. Just now, I asked ChatGPT: “Tell me ten interesting things about meeting design from the writings of Adrian Segar”. Two of the ten responses are nonsense. A novice won’t know that two are fantasy and eight are decent information.

When I asked ChatGPT: “Tell me ten fun facts about Adrian Segar”, one of its responses was that I was an expert juggler. Who knows how it made that up. I can’t even keep three balls in the air!

In general, if you want to see how deceptive the answers that ChatGPT provides can be, ask it for information about something you know a lot about. IME, garbage will be mixed in.

Simon · July 9, 2024, 8:35pm

But that’s precisely the point. The masses that are cheering on ChatGPT et al. are people who have seen it give nice looking responses to questions about matters they are entirely clueless about and hence easily impressed by trustworthy and confident sounding answers that come in fancy bullet lists (heck, even people on this board have started emulating that style). Those of us who are deeply knowledgable about a certain topic and have seen the kind of garbage that some of these tools render on those topics (just like social media before), are well aware of the misinformation, the poor SNR, and the ruse. But fat chance having some boring old librarian type expert get the plebs lusting for the “next big thing” to take a more critical stance. I’d be temped to say, so let them run off that cliff. Except I’m then quickly reminded they’re not just sealing their fate, they’re sealing mine too — just like Jan 6 coup attempted to destroy the oldest democracy not just for FB drones and those who spend half their life on “social media”, but for everybody. Yet still, what else can you do but try to point out at every step that we need to take a far more critical stance than just “let’s cross our fingers and toes” we’ll reap only the good.

Shamino · July 9, 2024, 11:57pm

Ultimately, the takeaway is what I’ve been saying for years: The biggest danger from AI is the fact that people trust it.

steve25 · July 10, 2024, 1:10am

Really liked your article, Adam, especially the section “It’s called the open Web for a reason.” If you put something on the open web, you can hope that people will use the material in a respectful way, knowing that some people won’t, and that many people’s understanding of what respectful means will differ from yours anyway. I agree, if one’s not comfortable with that arrangement, then you shouldn’t publish on the open web.

chengengaun · July 10, 2024, 2:23am

I think in addition to trusting output of AI products, anthromorphosizing AI products is another problem. Many people tend to project human qualities (e.g. cleverness, mannerism) to such tools, which is understandable especially that they mimic human responses so well; companies have played to this, labelling them as intelligent assistants, etc. While that makes AI tools more accessible, it increases the risk of using fallacious analogies, exaggerating AI tools’ capabilities and dangers, and general misunderstanding of AI tools.