There is a particular kind of dread that comes with inheriting a data lake from a previous engineering team: the storage is enormous, and the documentation is thin, or missing entirely, or written in shorthand that only made sense to people who no longer work there.
Companies that built these lakes between 2018 and 2022 often did so in a hurry, pushing data in from every connected system with a loose intention of organizing it later.
The firms now offering data lake consulting to enterprise clients have found that “later” has a weight to it, compressed into unmapped tables and orphaned pipelines that nobody on the current team can fully explain. What those firms are doing, increasingly, looks less like engineering and more like archeology.
Arriving at these environments reveals a particular texture of accumulated neglect. Firms now providing consulting services around legacy data architecture are routinely finding actual strata inside these lakes: schemas modified half a dozen times with no changelog, folders of raw CSVs that nobody ingested, Kafka topics still receiving events from a service deprecated in 2021.
And data quality is the top operational concern among enterprise data leaders, with more than 60% of organizations reporting that a material share of their lake assets cannot be reliably used without remediation work first.
The Weight of What Nobody Mapped
A company runs on its data lake. Reports get built, dashboards go live, and the engineers who designed the original ingestion pipelines move on, taking institutional knowledge with them.
Their replacements inherit something nobody documented well, with business rules embedded in transformation logic that predates the current stack by half a decade. Join conditions that made sense once, when the product had one version, are hardcoded into a transformation that nobody has touched in three years.
The data is real, and the value buried inside that logic is also real. But without lineage or any reliable sense of what the tables mean relative to each other, the lake sits there like a sealed archive that everyone quietly routes around.
Technical debt in data infrastructure has started receiving the same sober attention that application-layer debt has gotten for years. Large organizations typically spend 10 to 20% of their technology budgets managing the consequences of past architectural decisions rather than developing new features.
For data-heavy organizations, the share typically runs higher. The legacy lake is not a sideshow. It sits at the center of the problem, and the engineers who inherit it are the ones who feel that most acutely.
What makes the current moment different from previous cleanup attempts is the arrival of large language models as practical tools for schema inference and semantic classification. Not as a silver bullet. As a trowel.
When the Algorithm Does the Digging?
The consultants doing this work now have something their predecessors did not: models that can read a table called usr_trxn_evt_raw_v3 and infer, from its column names, value distributions, and relationships to neighboring tables, what it probably contains and why it was probably created.
They crawl hundreds of tables across a lake, group them by likely business domain, flag the ones with obvious quality problems, and surface the ones that appear to carry dormant value.
Not with certainty. With enough signal to prioritize. The speed of that initial surface scan is what changes the economics of the recovery work.
Firms like N-iX have built practices around exactly this kind of recovery, combining automated classification with human-led domain validation to produce a working catalog from what was previously noise.
The process typically moves through several stages, though the order shifts depending on what the environment contains:
- Crawling and sampling table contents to identify data types and infer schemas where none are documented
- Running semantic grouping to cluster tables by probable business function (finance, customer, product, operations)
- Flagging data quality anomalies, including missing values, impossible timestamps, duplicate keys, and format inconsistencies
- Mapping lineage where it can be reconstructed from query logs, ETL code, or ingestion metadata
- Delivering a prioritized catalog that separates live, reliable assets from deprecated or low-quality ones
The list does not capture how iterative the process actually is. A table that looks abandoned may have an active downstream consumer.
A seemingly reliable dataset may carry a silent quality failure that only surfaces when tested against a known business result.
Human judgment, guided by the model’s output, is what turns a raw crawl into a catalog that teams will actually trust and use.
Getting the catalog wrong means people stop consulting it, and a catalog nobody consults is worse than no catalog at all.
Databricks notes a sharp increase in enterprises applying LLM-based tooling to data cataloging and governance tasks, concentrated among organizations managing lakes older than three years.
The Case for Patience
Somewhere in most legacy lakes, there is value the business has already forgotten about. A dataset predating the current CRM migration, carrying customer behavior patterns that no current system captures.
A product usage log from a feature discontinued years ago, whose behavioral signals still apply to something the company sells today. Often, the people who built these datasets left years before anyone thought to document what they were for.
Data lake consulting, at its best, is not a cleanup service. It is a recovery operation, and the consultants who do this work well bring engineering precision alongside something closer to forensic patience, the willingness to sit with an unfamiliar data structure long enough to understand what it was trying to say.
Enterprises that treat the legacy lake as a sunk cost to be written off are often making the same mistake twice: they added data without understanding it, and now they are discarding it for exactly the same reason.
Final Word
The legacy lake problem is not going away, and the enterprises that built these environments are not going to rebuild them from scratch.
The practical answer is to work with what already exists, which requires both the methods and the mindset to treat old data as something worth recovering rather than routing around.
For companies carrying four-plus years of unmapped assets, the question is not whether that data has value. The question is who has the tools and the patience to find it.

