Why AI Web Scraping (Mostly) Doesn’t Bother Me

Halfsmoke · July 10, 2024, 4:07pm

As a message board participant on TB and on other public sites (as opposed to a professional content creator), my main concerns about AI scraping are:

A third party (parties?) is archiving my posts and is using my thoughts and words to generate derivative content without any ability to opt-in or opt-out.
How else are our posts being used by organizations without our knowledge? Not all AI’s are used to simulate human writing like Google Bard or Microsoft ChatGPT. For example, what if a hacker group is training an identity theft AI? Or a state-sponsored intelligence agency is building an AI whose purpose is to find and track citizens who are living abroad?

I’ve always assumed that anything I post on Internet message boards would be crawled by search engines and made available in searches. But for me, having entire discussion forums hoovered up, archived, and repeatedly used to train black-box AI processes goes way beyond that.

ace · July 11, 2024, 4:03pm

When I dealt with these sites during the Take Control days, it wasn’t easy to say they were obviously pirate sites. More generally, they were “file sharing” sites that allowed anything to be uploaded and then responded to DMCA takedowns for copyrighted content. So lots of what they held was probably legitimate, and I’m not sure there would be an easy algorithmic way to teach a crawler to avoid them. It seems obvious that most of the AI companies don’t really care either, but that’s not to suggest that it was an easy or even possible solution either.

Ah, that wasn’t what I meant, but I was making some assumptions based on Take Control that probably aren’t true for you. I agree that OpenAI can’t know anything about your sales or what users are generating based on your books. When we were doing Take Control, we had complete insight into all sales channels in real-time, so we could tell when sales went up or down in response to some action or environmental change. Plus, because we had direct communication with all purchasers, we could make sure that they knew certain things. I think I can count on one hand the number of times we heard from someone who had downloaded a book illegally and was confused about something, whereas legitimate purchasers wrote in all the time. (That’s across 500K sales over 14 years.) So while I can’t prove that online file sharing sites didn’t hurt sales, I just don’t think so. In contrast, per my story about the iBookstore, I can prove that Apple driving book prices to near zero hurt sales badly.

Right. But a novice would stop at “Tell me ten interesting things about meeting design” because they won’t know you’re an expert in the field. I’m a novice (though one who runs regular meetings), and when I fed ChatGPT that prompt, what came back was entirely reasonable. Feeding it the prompt that mentioned you did indeed return much different stuff, but since I know nothing about your writings, I couldn’t evaluate its accuracy. I pushed it a bit, and while I learned a bit about fishbowl conversations, for example, I didn’t come away thinking that I knew enough to hold one, and in fact, if I were going to try one, I’d want to learn more about in ways that didn’t require me to keep asking questions. And when I asked for books that would help, it did suggest one of yours.

This gets back to my point about generative AI chatbots doing C+ work. It’s usually not completely wrong, though it usually lacks in detail and nuance, some of which can be recovered by continuing the discussion. Not everyone wants, has access to, or can afford expert information on every topic at all times. If I’m looking for advice on dissuading bats from roosting in my soffit vents, the basics are fine—I don’t want a treatise, and it’s a lot easier to read a summary than try to figure out the agenda behind all the websites trying to sell me a solution.

Yes, that’s undoubtedly happening—there’s very little granularity with respect to specific permissions, and what does exist is only at the site level with robots.txt or authentication requirements. I could easily lock down this site so only people with accounts could read it, keeping all the bots out, but I believe that information should be shared.

Since it’s without our knowledge, it’s not something we can answer. But see above. If you’re worried about being tracked, you shouldn’t post on the open Web. The old saw about how you shouldn’t put anything in email that you don’t want to end up on the front page of the New York Times is even more true with posting on the open Web.

(Just FYI, Bard has been renamed to Gemini, and Microsoft licenses ChatGPT from OpenAI.)

xdev · July 11, 2024, 4:40pm

It occurs to me that all this fuss about generative AI is like a talking dog. Everyone is so amazed that the dog can talk, no one is bothering to ask if the dog actually has anything valuable to say!

Shamino · July 11, 2024, 5:13pm

On the other hand, the same could be said for file sharing networks like Gnutella and Napster - which were really just glorified search engines, providing links to content hosted by millions of individuals worldwide.

But that didn’t protect them from prosecution, resulting in a shutdown of any servers (for centralized protocols) and termination of nearly all client software projects (for decentralized protocols).

Of course, the prosecution didn’t stop anything. These file sharing networks still exist, and many protocols (like BitTorrent) are still really popular. What ended the massive free for all of piracy was the existence of cheap stores and streaming services, for which Apple broke the ice with the iTunes Music Store.

ace · July 11, 2024, 5:27pm

That’s the problem—the dog does have something to say when prompted appropriately. It may not be Mr. Peabody, but even the basic chatbots are useful right now, and generative AI overall has incredible promise. Exactly what the chatbots are useful for varies widely by person, which works against blanket statements.

fischej · July 11, 2024, 5:59pm

There’s a blast from our apparently shared childhood past