Follow

Finally wrote a post that's been stewing for a while: What You Miss By Only Checking GitHub

Many researchers, entrepreneurs, open source sustainability commentators, et al. assume that GitHub activity is a reasonable proxy for FLOSS as a whole. It's not.

harihareswara.net/posts/2022/w

Goes over some examples, a marketing graphic that made my eyebrows go up, research about how unrepresentative GitHub can be, and some tools to try.

When you come across some new headline or initiative or statistic about open source, it's worth checking: do they only mean "on GitHub"? Kind of like when I see a headline about "everybody" and they mean "in the US".

@brainwane

Discoverability is a major issue in FLOSS.

Those of us who remember "Freshmeat" are now old.

But I do think we need something akin to it, or better.

Thoughts?

@emacsen I think discoverability, in general, got easier when the web became more searchable... but there are different kinds of discoverability, e.g.,

1. finding something when you go actively seeking things like that
2. having a feed you can constantly visit so you become ambiently aware of potentially useful tools
3. getting notified/suggested when a friend, writer, speaker, or marketer wants you to know of a particular tool

@brainwane

I love when smart people disagree with me, it's such an opportunity to learn :)

I was thinking of #1, #2, and another one, which is "Where do I go for support on this?", which could be community support channels (mailing lists, IRC, Matrix, Discord, Git* issues), or paid commercial support.

@emacsen "Where do I go for support on this" is interesting partly because the answer often crosses "open source project"-specific boundaries. The example that comes to mind most easily for me is Python packaging, but any toolchain with multiple permutations of interoperable parts will also run into this. Considering....

@brainwane

The amount of friction required to get commercial support for FLOSS software is astronomical.

After decades, I finally have resources to pay people for work I want done, and I see tons of people looking for work online, and yet finding people to support tools that I want to support is challenging.

It's not even primarily a money issue.

If someone fixed this well (not just Fiverr or Upwork), I think they'd do the community a huge service and strike it rich.

(they'd also be hated)

@pixelherodev @brainwane

Maybe this is amazing software (I don't know), but it certainly makes for a challenging first impression.

The project description just says "This is the source code for [this website]." The website doesn't exist AFAICT and the installation file mentioned doesn't exist either.

> We have not been able to prioritize this work, but we have developed enough of it that we can work with potential contributors. If you’re interested in helping us bring this service to life, familiarize yourself with the existing code and see what you can do.

quoted from sourcehut.org/blog/2022-08-23-

@brainwane omg this sounds like an awesome post. Very excited to read it!

@brainwane i wish it could be federated. So one search to rule them all!

@pbaesse I haven't yet poked at openhub.net/ a lot to check how good its search coverage is....

@brainwane I'm happy to see that softwareheratige is now accepting small cgit instances at archive.softwareheritage.org/a

Much easier than submitting and resubmitting individual repos.

I think there is still a missing discoverability piece here, I'll bet 90% of self-hosted repos have not been archived by them. But also I think you're right to highlight them as a good example.

@joeyh @brainwane Trying to add https://gitweb.gentoo.org/ to it but it's throwing me an "The provided forge URL is not valid" before the form is sent when JS is on and without JS I get an error:
> ImproperlyConfigured: Returned a template response with no `template_name` attribute set on either the view or response

@lanodan @alcinnz also happened to me. I reported and they said it was broken, but they are deploying a fix tomorrow

@joeyh @brainwane If you have a good idea for adding discoverability functionality to Software Heritage, contact them. They listen, and they care about the Open Source ecosystem.

@brainwane I have to say, I'm highly skeptical the supposed value of "searchability" GitHub brings, I find DuckDuckGo or Google to be just as good... I prefer package repos like Debian, or depending on how well managed they are (i.e. not NPM!) language-specific ones like Hackage. There I know some quality-control is present!

Btw I specifically choose to self-host my hobby projects so as not to feed into the silo-mindset which is actively hurting them.

@alcinnz I presume that, when you search, you're looking for projects to use?

I think most of the people and institutions I am speaking to/about in my post are using GitHub search, and data from the GitHub API, for other reasons.

What tools do you use to self-host and do you like them?

@brainwane I'm using CGit with a rudimentary issue tracker. Currently all issues are forwarded to my feedreader for me to copy anything non-spammy to .ISSUES directories in my repos. I found this very easy to setup once I had a homeserver!

@brainwane And yes, my searching is about locating projects to use!

@alcinnz Thanks for the replies! Using a feedreader is especially an interesting idea...

@brainwane is there a way to do a full text search of the software heritage archive? I often use github.com/search when looking for projects, but would like to be able to search more widely that github. The search capability Software Heritage provides seems pretty lacking.

@brainwane Okay, it looks like the existing search on their site is limited to just urls and some limited metadata. Someone could conceivably download their data here: docs.softwareheritage.org/deve and use it to create a more robust search. But as the archive is 11 TB, this would be non-trivial.

@Natris1979 Another option maybe: openhub.net/ ?

cc @zacchiro ^ a question for you/Software Heritage @SWHeritage about searching more facets of the repositories archived there.

@brainwane @Natris1979 @SWHeritage thanks for the highlight!

Full-text search is part of our technical roadmap, but the resources to deploy that at our scale are very significant and we don't have them right now.

The archive size is, in fact, 1 PiB (the 11 TiB mentioned above are just for the graph structure of the archive, not the actual source code files) and a decent fulltext index will be roughly the same size.

People and/or companies interested in helping us out with this are welcome!

@Natris1979 @brainwane just on this one, the archive size if 1 PiB, not 11 TiB (which is just the archive "structure" in terms of commits and source code trees).

See mastodon.xyz/web/@zacchiro/109 for a more detailed answer on full text search plans.

@zacchiro @brainwane holy moly. My brain went from "huh, I bet I could learn to make some kind of full text search work over 11 tb" to "nope that is well beyond my skills"

@Natris1979 @zacchiro I wonder whether you really need fulltext, though, for what you want. Maybe you just need better metadata search to get 80% of the way?

@brainwane @zacchiro sure. For my purposes just like url plus the readme from the main branch would catch 95% of my use cases and I wouldn't mind falling back to GitHub or the less comprehensive/open code search engines when I need to go deeper. Most of the time my question is "is there a tiny open source project that does this one very particular thing before I spend a day making it"? Web search doesn't work well for that, and GitHub search is [see original thread]

@Natris1979 @brainwane Full text indexing of all content is certainly needed down the line. But "only" full text indexing README is a very interesting idea that I don't think we have explored in the past. Maybe it's something it could just fit in our current indexing infrastructure for metadata. Thanks for raising it!

@brainwane @Natris1979 on metadata search, for sure. Right now we index "package metadata" from a relatively small subset of package formats out there. We're working to extend coverage to other formats, as well as to crawl metadata from forges (e.g., github descriptions, tags, etc.). Search for "metadata" here: docs.softwareheritage.org/deve Code contributions are welcome on that front (and the tech entry barrier should be fairly low).

Sign in to participate in the conversation
social.coop

A Fediverse instance for people interested in cooperative and collective projects.