social.coop is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Fediverse instance for people interested in cooperative and collective projects. If you are interested in joining our community, please apply at https://join.social.coop/registration-form.html.

Administered by:

Server stats:

499
active users

Now that we're putting all our denormalized output tables and analyses into the DB, we've got a lot more to manage, and are trying to figure out how to best combine existing tools to do it.

GitHub Discussion: github.com/orgs/catalyst-coope

Currently we store column, table, and dataset level information in big JSON-ish data structures, which are converted into objects using @pydantic models based (loosely) on the tabular data package abstractions.

GitHubExisting tools for managing our metadata? · catalyst-cooperative · Discussion #2546We have a lot of metadata describing the hundreds of tables and thousands of columns that are part of PUDL, and a somewhat homebrew system for managing it, using a mix of Pydantic and SQLAlchemy. I...

The metadata is used:
* To manage data types in different systems (pandas, SQLite, Parquet/PyArrow)
* To create DB schemas with @zzzeek's
* To validate both individual data points and aggregate values
* To export metadata for use by other systems like @Datasette or tabular data packages
* To build human-readable documentation like data dictionaries in Sphinx.

It seems like this must be a common set of tasks, and there's a bunch of different open tools that take on different aspects of it, but we're struggling a bit to glue them all together in an elegant way that doesn't require duplicating data structures or the underlying metadata.

Ideally we'd like to use one data structure to compile the information, and then translate it as needed between different systems using e.g. factory functions.

Some mix of:
* Pydantic models for data validation / type hinting
* Pandera dataframe schemas to get Pydantic to work with pandas and do validation within @dagster
* SQLAlchemy DB schemas
* Frictionless Data tabular data packages
* Maybe we could feed into Hypothesis for creating test data too.
* @tiangolo's SQLModel feels like one useful connector but it doesn't seem to be under active development.