I did not realize you can post up to 100GB of data to #Kaggle and they provide access to computational resources and #Jupyter notebooks.
We're thinking about automatically posting all our #PUDL data there, and maybe running community competitions to help solve entity matching, anomaly detection, and imputation problems. Is there any downside to doing this?
#OpenData #MachineLearning #DataScience #EnergyTransition #EnergyTwitter #EnergyMastodon
https://www.kaggle.com/datasets/zaneselvans/catalyst-cooperative-pudl
If nothing else it's a way to publicize the data, and provide potential users an easy way to interactively explore it!
@ZaneSelvans sounds awesome!
And, if it's open data, I don't see any downside.
They now have machines with 16gb of RAM for free and their datasets are the first place I search for open data about any subject.
@hugoboia cool! I think we can pretty easily set up our nightly builds to update the dataset periodically. Do people participate in community competitions at all?
@ZaneSelvans I'm not really sure.
Seems like most of the competitors prefer the ones with prizes...
Maybe sharing an invite here, so people can repost and spread the word, is a good idea.
At least one participant you already have.
@hugoboia yeah we would need to try and drive folks who care about the underlying climate application to the competition I think. But could still be a useful structure.
@ZaneSelvans also, the Kaggle Courses are really straightforward and teach from Python basics to Geospatial and ML, so even for people interested on climate but not used to programming languages it can be useful.