Should we convert an existing Data Warehouse to Data Lake?
Should we convert an existing Data Warehouse to Data Lake?
Often I get this question.
I think this decision should be driven by rational evaluation of business needs & not because “all cool dudes are doing it”.
DW has been around for a couple of decades. Historically, it’s been a gold-copy of operational data in structured format & the main BI enabler. Many of these DWs are built on legacy platforms on-premise with layers of interconnected data-marts evolving over years into complex “bus” constructs. Also, the business that used a DW for decades understands its data very well for analytic queries.
In my mind, converting a construct of such a long stature is anything but simple.
On the other hand, the data lake (DL) still lacks clarity in the minds of many people. While the fluidity, flexibility, capabilities, & cost-efficiency of DLs are unquestionably superior to traditional DWs, getting “business buy-in” could be challenging to start with. This is because of the sheer magnitude of functionalities that DL brings can be overwhelming to business.
So what basic questions should we ask to understand what is business’s true need – a DW, a DL, or a hybrid of DW & DL?
In my mind, the following are fundamental nuances but not limited to those:
1. Is there a need to ingest semi-structured & unstructured data that flow in at a high velocity, wide variety, and in huge volumes from across the enterprise, and store them at the same location together with structured operational data in a storage-optimized way?
2. Do we have a need for predictive analytics leveraging enterprise data?
3. Do we need real-time or near-real-time data ingestion for streaming analytics, operational reporting, etc?
4. Do we need a polyglotic persistence of enterprise data for purpose-driven consumption?
5. Is there a need to develop a data lake from scratch, to deliver the above capabilities or augment the capabilities of an existing data warehouse? It’s normally cheaper to build data lakes on the cloud than being rearchitected from an existing data warehouse?
6. Do we need to improve on a low ratio of value-per-byte / cost-per-compute for the current data warehouse?
Please feel free to share your thoughts and experiences.