No one information a ticket that claims “our structure has an abstraction downside.” They file tickets saying the info is unsuitable, or lacking, or late. So engineering spends two weeks chasing a data-quality challenge that doesn’t exist, fixes nothing, and the identical ticket comes again the next quarter sporting a barely totally different hat.
That was us. Essentially the most helpful factor I discovered from the entire effort is that the bug was by no means within the information. It was in what we have been asking the info to be.
We had an on-premises MongoDB occasion serving because the registered golden supply for enterprise reference information. Codes, classifications, identification lookups, the unglamorous shared information that quietly underpins buyer onboarding, regulatory reporting, and a dozen different issues folks solely discover once they break. It was well-maintained, authoritative, the real single supply of fact. The workforce that owned it was rightly happy with it. By each cheap measure, the system was wholesome.
And but each time an analytics workforce or a downstream product group wanted one thing from it, the expertise was depressing. They reverse-engineered the operational schema. They wrote one-off queries in opposition to nested JSON they solely half understood. They tracked down whoever nonetheless carried the institutional reminiscence of the gathering construction, waited, after which repeated the whole ritual three months later when the requirement shifted by an inch.
The prognosis took longer than it ought to have
I watched this play out for months earlier than it clicked. The information was nice. We have been asking an operational retailer to moonlight as an analytical platform, and it was unhealthy on the second job. Not by means of any flaw of its personal. It was merely by no means constructed for that.
Operational shops optimise for correctness and life cycle administration. Analytics groups want one thing else fully: steady shapes, fields which are truly documented, a refresh cadence you possibly can predict, and a technique to decide whether or not a dataset is match for goal with out reverse-engineering another person’s schema. These usually are not the identical necessities, and conflating them is exactly how you find yourself with a system that’s technically good and virtually ineffective. Wholesome uptime, depressing shoppers.
So we stopped asking folks to eat reference information instantly from MongoDB. We began treating every dataset as a knowledge product: one thing with a named proprietor, a definition, high quality gates, ruled entry, and an actual path to publication. The technical pipeline, MongoDB by means of Kafka Join into Touchdown, Bronze and Silver layers as Iceberg tables on S3, Athena on high, publication by means of the Information Market, adopted from that call relatively than driving it. Twenty-one reference information merchandise ultimately shipped down that single path.


Determine 1: The total pipeline. MongoDB because the authoritative golden supply, occasions flowing by means of Kafka into Touchdown, Bronze and Silver layers as Iceberg tables on S3, Athena offering the question floor, and the enterprise Information Market because the publication endpoint. Airflow orchestrates all the pieces; DPPS UI offers operational visibility.
What “information product” truly pressured us to determine
“Information product” is a kind of phrases that may imply virtually something, which often means it means nothing. So we made it imply one thing particular and non-negotiable: a dataset couldn’t be revealed till it had a named proprietor, a knowledge dictionary, enterprise and technical metadata, documented audit expectations, high quality gates, and a ruled route into the Market. Compliance with all energetic requirements at deployment time was necessary, enforced at publication, not requested in a evaluation assembly.
That framing instantly surfaced questions that ought to have been answered years earlier. What’s the precise boundary of this product? Which attributes matter to shoppers, and that are operational plumbing no one outdoors the proudly owning workforce cares about? What does “present” imply for this dataset, and the way would a shopper know if it had gone stale? How does anybody uncover it with out submitting a ticket and ready for a human to level them on the proper S3 path?
None of that was governance overhead bolted on for present. Answering these questions was the structure. The Kafka connectors and Iceberg tables have been virtually the straightforward half by comparability.
The three choices that formed all the pieces else
The primary resolution was to maintain MongoDB because the golden supply. No rip-and-replace. Authority stayed the place it belonged, with the workforce that understood the info’s lifecycle and had maintained it accurately for years. The enterprise requirement was specific: no business-logic transformation, a one-to-one mapping from supply to vacation spot, devoted preservation relatively than enrichment. The temptation to crown a shiny new system because the supply of fact lurks in each modernisation venture, and it’s virtually at all times unsuitable. MongoDB did its job effectively. We have been constructing a supply layer, not changing a basis, and complicated the 2 is how good migrations flip into eighteen-month disasters.
The second was to construct one supply mannequin as an alternative of tolerating 4. Earlier than this work, at the least 4 groups had independently extracted roughly the identical reference information, every with its personal refresh logic, its personal studying of the sphere semantics, and its personal personal definition of “present.” The diplomatic phrase for that state of affairs is “decentralised.” The sincere phrase is chaos. Occasions flowing from MongoDB by means of Kafka Join into the pipeline, Airflow orchestrating a month-to-month batch on the fifth at 07:00 UTC with no dependency on working days or vacation calendars, schema validation firing earlier than something touched S3, changed all 4 personal empires with a single path anybody may cause about.
The price of these 4 pipelines was by no means the compute or the storage, which was trivial. It was the reconciliation tax. Every time two copies disagreed, they usually did, somebody senior and busy needed to work out which one to consider. Multiply a half-day investigation by each quarter and each consuming workforce and also you arrive at a genuinely costly behavior that by no means appeared on any price range line, as a result of it was hidden inside everybody’s atypical work. Collapsing 4 pipelines into one didn’t simply simplify the diagram. It deleted a whole recurring class of argument.
The third was to deal with publication as an actual pipeline stage relatively than an afterthought. Information that reached Silver acquired revealed into the Information Market with metadata, a Kitemark high quality rating, documentation, and subscription behaviour already connected. Consumption occurred completely by means of the Market subscription mannequin, by no means by handing somebody an S3 path. Shoppers may discover a product, decide whether or not it match, and subscribe to it without having to know which bucket to ask about or which Slack channel to beg in. Publication meant the product went reside. It didn’t imply a file quietly appeared in storage and somebody hoped the fitting folks would discover.
The boring stuff turned out to be the laborious stuff
I stored ready for the laborious issues to point out up within the pipeline itself. Kafka connector configuration, Iceberg desk upkeep, Athena partition tuning, all of it wanted consideration, and all of it acquired sorted in the end. However the hole between “a pipeline that works” and “a platform folks belief” got here from the issues I used to wave off as housekeeping. Naming conventions. Audit column requirements. Documentation templates somebody would truly open. Possession that was actual relatively than nominal.
Naming is an effective instance of how unglamorous and the way decisive this will get. A shopper looking out {the catalogue} has to discover a dataset utilizing enterprise-standard terminology, not the inner shorthand that made sense to the workforce that constructed it. The metadata framework mapping to the enterprise normal is tedious work that exhibits up on no demo. Additionally it is the whole distinction between a listing folks can navigate and an inventory of cryptic desk names solely the authors perceive.
Right here is the uncomfortable half I didn’t recognize moving into: shared enterprise information tends to fail socially earlier than it fails technically. The Kafka connector shall be nice. What corrodes is the shared understanding of what “authoritative” means in observe, whether or not a given dataset is the true one or a duplicate any person made eighteen months in the past and forgot to deprecate. No quantity of Iceberg optimisation touches that. You repair it on the layer the place shoppers determine whether or not to belief a dataset, which is the product layer, and nowhere else.
A concrete instance of how social this will get. Early on, two groups disagreed about which currency-code dataset was appropriate. Each have been internally constant. Each had been “proper” sooner or later. The distinction got here right down to a refresh one workforce had quietly stopped working a 12 months earlier, and neither workforce may show which copy mirrored the reside supply, as a result of nothing in both dataset recorded the place it got here from or when. We didn’t repair that with a greater connector. We fastened it by making provenance a first-class column. Each Silver file now carries SOURCE_SYSTEM, JOB_RUN_ID, VALID_FROM and VALID_TO, so the query “is that this the true one, and is it present?” has a documented reply as an alternative of a hallway debate.
Storage just isn’t the product
I’ve watched groups land information in S3, declare victory on self-service, after which spend six months baffled that no one is utilizing it. The reply is sort of at all times the identical. “The information is in S3” just isn’t a product. It’s a location. Folks must know the info exists, work out what it means, decide whether or not it suits their goal, and discover out who to contact when one thing seems unsuitable. A path offers them none of that.
The Market addressed this greater than any particular person pipeline part did. It turned a scattered set of S3 paths right into a ruled catalogue of subscribable merchandise, every with documentation, a top quality rating, and clear possession. That’s the distinction between handing somebody a warehouse handle and handing them a store. And since subscription is the one sanctioned path to the info, {the catalogue} stays the one entrance door relatively than one possibility amongst a number of personal again channels.
Separate fact, transport, and consumption
If I had 5 minutes with somebody beginning this work, I might spend all of it on one thought. Separate fact, transport, and consumption, and deal with them as three totally different considerations owned by three totally different components of the system. MongoDB holds fact, and stays authoritative. The pipeline, Touchdown by means of Bronze to Silver, strikes that fact reliably and proves it arrived intact with checksum reconciliation and inter-layer record-count checks. The product layer, Silver tables, Athena, and the Market, makes fact consumable by individuals who have no idea and will by no means must understand how MongoDB organises its collections.


Determine 2: The identical information, three separated planes. Fact stays within the operational golden supply; transport strikes it and proves it arrived intact; consumption exposes it as ruled, subscribable merchandise. Separating the three considerations, every with its personal proprietor, is what removes the friction between producers and shoppers.
When these three are genuinely separate, an infinite quantity of organisational friction merely evaporates. Producers cease getting dragged into ad-hoc reporting. Shoppers cease reverse-engineering operational intent. The ops workforce can evolve the MongoDB schema with out shattering six downstream jobs. And a brand new workforce that wants nation codes or foreign money classifications can discover them within the Market, learn the documentation, and be achieved in a day as an alternative of 1 / 4.
The information was at all times nice. What we truly constructed was the boundary that permit everybody cease arguing about it.