GenAI bots are pushing websites into a corner that imperils open access, and perhaps worse, the web's historical record. From @gluejar:
https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html
Assuming that the web will continue to evolve instead of getting crushed underfoot, there is some interesting work going on over at the IETF about how to build on the now aged robots.txt protocol to allow rights holders to express how their content can be used online:
My experience with @base and other web services run by Bielefeld University Library is in line with @gluejar's.
The IETF sound naive when they claim that “[r]ight now, AI vendors use a confusing array of non-standard signals in the robots.txt file (defined by RFC 9309) and elsewhere to guide their crawling and training decisions” when in reality many of them ignore whatever signals a website sends them. They even plunder the shadow libraries.
@chpietsch yes, I guess you could look at it as naive. In many ways robots.txt was naive too. But one aspect to this is that we need ways for rights holders to assert their wishes, so that courts in jurisdictions that care (e.g. the EU) can use them as evidence. And there needs to be more nuance than what robots.txt provides:
https://mailarchive.ietf.org/arch/msg/ai-control/EJ-84k8Zzh21vY1dHPZvDeYOLes/
@edsu @chpietsch Was it naïve or a brilliant way to avoid regulation? Remember “do not track?” Ditto. I think naïve is thinking that the assholes in Big Tech don’t know exactly what they’re doing when they seek to avoid accountability. But hey, at this point, they have the world’s most lethal military behind them so I guess accountability is moot.
@edsu@social.coop @chpietsch@fedifreu.de to be real I would prefer if we DIDNT empower massive corporations to more strictly restrict content via copyright and IP law directly, this IS going to be abused to lock down information from regular people more than it helps them
What is truly naive is thinking massive AI scraping operations which already break several laws are going to suddenly start following the rules if you make the rules good enough, in reality this is going to further lock down the internet while these massive scraping operations change nothing, keep doing crimes, and then claim everything is "fair use" in the courts and no rights holder can complain anyway
seems really misguided and bad to legally strengthen a system that heavily restricts information for a group of AI scrapers who will happily commit crimes and grab more than they're allowed regardless, in the AI world more data than your competitor literally means greater profits, and until that shifts this proposal will only harm regular people imo