Now that we're putting all our denormalized output tables and analyses into the #PUDL DB, we've got a lot more #metadata to manage, and are trying to figure out how to best combine existing tools to do it.
GitHub Discussion: https://github.com/orgs/catalyst-cooperative/discussions/2546
Currently we store column, table, and dataset level information in big JSON-ish #python data structures, which are converted into objects using @pydantic models based (loosely) on the #FrictionlessData tabular data package abstractions.
The metadata is used:
* To manage data types in different systems (pandas, SQLite, Parquet/PyArrow)
* To create DB schemas with @zzzeek's #SQLAlchemy
* To validate both individual data points and aggregate values
* To export metadata for use by other systems like @Datasette or tabular data packages
* To build human-readable documentation like data dictionaries in Sphinx.
It seems like this must be a common set of tasks, and there's a bunch of different open tools that take on different aspects of it, but we're struggling a bit to glue them all together in an elegant way that doesn't require duplicating data structures or the underlying metadata.
Ideally we'd like to use one data structure to compile the information, and then translate it as needed between different systems using e.g. factory functions.
Some mix of:
* Pydantic models for data validation / type hinting
* Pandera dataframe schemas to get Pydantic to work with pandas and do validation within @dagster
* SQLAlchemy DB schemas
* Frictionless Data tabular data packages
* Maybe we could feed into Hypothesis for creating test data too.
* @tiangolo's SQLModel feels like one useful connector but it doesn't seem to be under active development.
ANYway, if you're facing similar issues or have any experience to lend, we've created a discussion on GitHub and would love to hear from other folks:
https://github.com/orgs/catalyst-cooperative/discussions/2546
@ZaneSelvans maybe @LudwigHuelk can chime in with his experience on the #OpenEnergyPlatform?
https://openenergy-platform.org