Data Technology Trend #2: Strategic (Part 1)
Simplification of Data
Make simple things simple, and complex things possible — Alan Kay
Trend #2.1: Data warehouse a big come back on Data Lake
With complex and ever-growing data sets, 1000’s columns had to be curated and normalized to build Data Warehouse. The very reason why Data Warehouses fail
What does Data Lake do:
Ideally, Data Lake stores all the data from the upstream systems and all data types without any throw-away. This initial store of the Raw data can be accessed by users on a need basis based on the type of analysis they want to perform on the data. Unlike Data Warehouses where there is an impeccable structure and adding any new data source takes time to curate and normalize, the staging like Data Lake does not take much time to integrate. Once this data is available. Since data is readily available, performing calculations on the fly and bringing out actionable insights with the available data points is not complicated. The cycle to realize the value of the data can begin within few weeks instead of months/years like in the traditional Data Warehouse. Further to storing this data in the Data Lakes, the Data Marts can be built directly out of Data Lakes using federated queries or RDS or Redshift / Snowflake depending on the budget and requirement of the firm.
How does it work:
Structured, Semi-structured and unstructured data can be stored as it is in the Data Lake. After processing the data, this data can be directly used to build Reports, BI, Analytics, etc., Also the same can be stored in the database for further use.
Example: Redshift & Data Lake (Glue and S3) in AWS. Cloud is a given for Data & Analytics
Best practices in Lake Formation to avoid creating a Swamp:
In order for the Data Lake to be successful, it is important to construct a “health Lake architecture and governance” in the first place. Being Agile and open data lake access does not mean that all the data can be simply dumped into one place. There has to be clear architecture and structure with defined data pipelines and flow to enable a healthy Lake Formation. Many organizations tend to switch to a new model every now and then as their current Data Lake does not work or it is already going in the direction of Data Swamps mainly due to non-governance and every business unit builds a small Data Lake on its own. Building a healthy Data Lake is a central and organization-level design even if it must include multi-cloud and multi-data lake design. This makes the healthy data lake open and on-demand queries or analytics can be run any time on any subset of data or as a whole. if Data Lake is the epicenter of an organization’s architecture, building a healthy one is imperative to build a scalable, sustainable Data Platform / Product out of it.
Data from different lines of businesses within the organization gets segregated into a place in its original form or natural stage. Data flows in the form of files/streams from different business units and is captured as is. The required user set can be given permissions to access and analyze the data based on need. All the data gets loaded from the source system and no data is thrown away. This source of data that is stored in the Data Lake is termed as “Bronze” — Raw/original data.
The data then gets filtered, cleaned, and augmented based on the needs and is moved to different functional buckets. Data is transformed and the required schema can be applied at this phase to make the data in the files query-able or can be further loaded into modern cloud databases. This source of data that is stored in the Data Lake is termed as “Silver” — Filtered, Cleaned, and Augmented data.
Further Business summary and explanations generated out of the processed data that enable smoother Decision Making for the Business. This source of data that is stored in the Data Lake is termed as “Gold” — Business summary.
If the data can be categorized into Bronze, Silver, and Gold, building Delta Lake in the future on top of this becomes easier.
When multiple Data Lakes across different cloud providers are involved due to regulatory restrictions, try to have similar structure and naming conventions across the business Lake formation so that there is synergy and the future expansion possibilities will be simplified.
What problem it tries to solve:
The main purpose to build Data Warehouses is to be able to generate analytics all from a single place. However, many Data Warehousing application fails due to 2 reasons
(1) Upstream to downstream data curation: Data Warehouses stores the curated data from the upstream as Data Warehouses are immutable. Will all the upstream be doing this? Not really. There will be data duplications across the organization and process overlaps that are built over several years. To identify the redundant processes and to build a unified system with upstream sending all relevant information is not a cakewalk (who said walking in cake is an easy task in the first place!) Building Data Warehouse has so many pre-cursors that needs to be satisfied, which often is overseen or not addressed before the Data Warehouse gets started. So, Data Warehouse comes back to the drawing board very often than not. You would have seen the existence of 150 + Data Warehouses in a firm. If Data Warehouse is the single version of the truth and is “Central” to the firm because there are so many Central Data Warehouses.
To implement a Data Warehousing solution in a firm, for business, there is always a constant trade-off between emergency and importance. Data Warehousing is not a low-hanging fruit, and its benefit will be realized over time. Any new functional-rich project will always be given importance over Data Warehousing.
(2) Factor of Time: Building Data Warehousing takes time. While answering the essential questions of Why Data Warehousing and what approach are we taking to build the Data Warehousing (Inmon Vs Kimball or combination based on organization needs etc.), a clear message should be delivered that building Data Warehouses takes substantial time and efforts and the benefits can be seen only in the long run. Many of the firms, keep this deadline as 1–2 years and marks the Data Warehousing as a failure, and moves on to building a new one within few years. Data Warehousing is a function of resources, time, quality, cost, technology, and most importantly data. Thousands and thousands of data points across several systems that needs to be read, understood, curated and normalized into a single warehouse is not an easy task!
What problems exist in Data Warehousing does not vanish just by introducing Data Lake or Delta Lake. However, unification of data into a single place without the need to curate 1000’s data points becomes quicker and easier. Also unlike Data Warehousing, unification of data will see the light of the day sooner as all the data in its original form is in a single place without any fancy modifications (with the ability to process structured and unstructured data and without the need to stick to proprietary file types). Building a Single version of the truth from this massive, unified data set by talking to one business at a time and bringing in only the essential fields becomes easier. Building Data Services for quick business use on top of this massive staging data becomes a lot easier.
- Data Lake for modern batch data warehouses
- Data Lake as the base for Delta Lake
- Multiple Data Lakes across several public clouds due to Regulatory restrictions
For other articles refer to luxananda.medium.com.