New bookmark: Why the internet needs crawl neutrality

As the maintainer of a long list of crawling search engines, this seems like a really important issue to me. Perhaps I should add info about crawlers and user agent strings to that list. “The web is hostile to upstart search engine crawlers, and most websites only allow Google’s crawler.”

Follow

@Seirdy Not that Google would go for this, but maybe instead of search neutrality we need collective crawling? One co-op crawling body with neutral access terms for that data.

And perhaps this could be coupled with a push model instead of a pull model. Websites that want to be included in search can submit new or changed URLs once a day for the crawler to pick up.

Then search engines can optimize access to their pool of data without burdening sites.

This has been my daily thonk. :thonking:

@cstanhope
> And perhaps this could be coupled with a push model instead of a pull model. Websites that want to be included in search can submit new or changed URLs once a day for the crawler to pick up.

Google supports WebSub; Bing and Yandex support IndexNow. WebSub is probably the way to go.

@Seirdy I didn't realize PubSubHubbub lived on as WebSub. Interesting.

@cstanhope Oh yeah, websub is a big deal. Tons of publishing and feed-reading services use it, and it's a huge part of many Fediverse (inc. Mastodon) and IndieWeb implementations.

@Seirdy Geez. Stop following the development of something for a minute, and it is "suddenly" everywhere. 😆

Sign in to participate in the conversation
social.coop

A Fediverse instance for people interested in cooperative and collective projects.