can anyone help me trace the historical context for the development of "data lakes?" I am seeing them used to accumulate data in a way that isn't all that useful to the people who interact with it, but is useful for the hosts to crawl and surveil -- how far off am I?
@jonny my understanding of a data lake is that it centralises the data that a particular enterprise has, to enable them to more effectively examine relations across it. the people interacting with it would generally be internal to the enterprise, so it's not clear to me what the distinction between them and "hosts" is here.
The cynical me sees this through the same lens that I now see G++ Circles... This is big tech fishing for specific information, but labeling it as a service. In this case, looking for process information. The format of raw data contains implicit information about the inputs, which may eventually be mined with machine learning: that's the bait for corporations. The switch is that, if enough companies (2 or 3) move their processing for the raw data on clouds, then the host can reverse engineer those processes, generalize them, and offer them to the industry as a whole. The eventuality is the cloud operator will acquire a direct competitor in that industry and/or go direct to that industry's customers with that knowledge
@jonny I'm just starting to encounter these and it feels like a digital episode of hoarders. Oh sure you might use that someday... You won't though. Here, pay nominal storage fees forever, while we put the old stuff on the slow disk, because you were never really going to look.
@jonny I am not persuaded that anyone derives value from data at rest. If there was too much to process it as a stream, letting it pile up won't improve matters. And getting anything back out of cloud storage tends to be a lot more expensive than putting it in was. To many companies thinking they're sitting on a treasure trove just because it's "data," while they have no plan for what to do with it, don't know which parts of it even matter, have an unlimited retention policy, and somehow miss that what they're sitting on is a trash heap.
@jonny Google trends is good for finding post-2000 phrase timelines.
For earlier terms and history, Google Ngrams Viewer.
Google Trends on "data lakes" suggests a ~2010 emergence, with larger development in 2015 and since 2017:
From my own experience with data analytics, a heck of a lot of this is driven by solutions providers trying to market their products. So: data processing, information intelligence, knowledge workers, online analytic processing (OLAP), data warehousing, "data marts" (1990s), "data bricks", etc.
@jonny book "Permanent record" by E. Snowed. I can't recall a specific chapter where he's discussing the specific topic of data lakes but the general sense of what he says is what you wrote. The main reason of this book is to explain why he leaked data. However there are for sure references that could help you go deeper.
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!