Social.coop @SocialCoop

Recent searches

Search options

Only available when logged in.

**Zane Selvans** @ZaneSelvans · Apr 26, 2023

Apr 26, 2023

Now that we're putting all our denormalized output tables and analyses into the #PUDL DB, we've got a lot more #metadata to manage, and are trying to figure out how to best combine existing tools to do it.

GitHub Discussion: https://github.com/orgs/catalyst-cooperative/discussions/2546

Currently we store column, table, and dataset level information in big JSON-ish #python data structures, which are converted into objects using @pydantic models based (loosely) on the #FrictionlessData tabular data package abstractions.

GitHubExisting tools for managing our metadata? · catalyst-cooperative · Discussion #2546We have a lot of metadata describing the hundreds of tables and thousands of columns that are part of PUDL, and a somewhat homebrew system for managing it, using a mix of Pydantic and SQLAlchemy. I...

#datadon

**Zane Selvans** @ZaneSelvans · Apr 26, 2023

Apr 26, 2023

Zane Selvans @ZaneSelvans

The metadata is used:
* To manage data types in different systems (pandas, SQLite, Parquet/PyArrow)
* To create DB schemas with @zzzeek's #SQLAlchemy
* To validate both individual data points and aggregate values
* To export metadata for use by other systems like @Datasette or tabular data packages
* To build human-readable documentation like data dictionaries in Sphinx.

**Zane Selvans** @ZaneSelvans · Apr 26, 2023

Apr 26, 2023

Zane Selvans @ZaneSelvans

It seems like this must be a common set of tasks, and there's a bunch of different open tools that take on different aspects of it, but we're struggling a bit to glue them all together in an elegant way that doesn't require duplicating data structures or the underlying metadata.

Ideally we'd like to use one data structure to compile the information, and then translate it as needed between different systems using e.g. factory functions.

**Zane Selvans** @ZaneSelvans · Apr 26, 2023

Apr 26, 2023

Zane Selvans @ZaneSelvans

Some mix of:
* Pydantic models for data validation / type hinting
* Pandera dataframe schemas to get Pydantic to work with pandas and do validation within @dagster
* SQLAlchemy DB schemas
* Frictionless Data tabular data packages
* Maybe we could feed into Hypothesis for creating test data too.
* @tiangolo's SQLModel feels like one useful connector but it doesn't seem to be under active development.