This article is a part of a multi-part series Data Technology Trends (parent article). Previous article — Link: Data Technology Trend #7: Monetized. Next part of this trend — https://luxananda.medium.com/data-technology-trend-8-data-next-part-2-f22eff2ad5a2.
Trend 8.1 Unified and Enriched Big Data and AI — Delta Lake
There have been several massive shifts in Data Technologies. We have seen the biggest shifts and especially in recent times, the data world has upended and the importance of data management, governance, strategy has taken a front seat.
Note: Unlike other trends in this series, for Delta Lake and Kafka Streaming, I am taking a bit of a deep dive as I think they are the future for Distributed file processing and data management. this article is “too long, do read!” :) Will try to make your time worth it.
The key shift in Data trends:
1. ETL -> ELT
2. Big Data Map Reduce -> Apache Spark
3. On-Premise -> Modern Cloud Data Platforms
4. Hadoop & Data Warehousing -> Data Lakes
5. Silos Databases -> Converged and unified OLTP, OLAP, and Analytics platforms
6. Data Streaming from alternative -> essential approach and more…
I genuinely think Delta Lake will be adopted by more and more organizations and Delta Lake is the future, which is why I wanted to highlight it as part of the final trend “Data Next”!
What is Delta Lake:
Delta Lake is open-source storage that enables to build “Lakehouse” architecture on top of existing systems such as S3, ADLS, GCS, and HDFS.
We have seen it all from Small Data to Data Warehouses to Data Lakes. For the application servers in general, with various cloud offerings, we see the trend of the computing layer has been separated from the storage layer and its ability to scale independently. In the same space, Database servers which had a tight coupling of storage and computing (processing) have been de-coupled — EMR, Data Lake, Snowflake, etc., are good examples of this. Delta Lake is also file-based (this could be your existing Data Lake, yes)– you have a query layer (or processing with Apache Spark) based on the massive file system with the ability to provide ACID consistency (which means, you can practically insert, update, delete the data — which is not a practical possibility for many of the modern cloud data platforms that are file-based), versioned data and is inherently scalable to petabytes or exabytes — how cool can that be! ACID consistency is ensured by the transaction logs (like the redo logs in Oracle or the WAL of PostgreSQL).
The problem Delta Lake is trying to solve
Apache Spark brought Big Data back into the limelight and solved the problem of building massive and performance-rich data sets for data pipelining and processing.
Data Lake solved the problem of unifying the repository of data that enables to build data platform and by leveraging this platform build data applications such as data warehousing, efficient performant data queries, and or analytics on top of it. The single version of truth shifted from paper to reality.
When there are effective technologies such as Apache Spark and Data Lake already in place and upended the Data world that is doing pretty well, what is the need for the adoption of another new technology? there still exist challenges from Data Reliability, security to Performance in Data lake space — in their own words.
Spark or Data Lake still requires a rich toolset and has substantial building efforts. Delta Lake puts a layer on top of these two amazing technologies and provides the required tools and technologies to work with for the organizations that aim to achieve unified data solutioning. So, Delta Lake is not a replacement for Apache Spark and/or Delta Lake rather it is an additional layer enabling organizations to build performant and unified Data Platforms. Spark does not have ACID consistency on its own, Delta Lake brings ACID consistency on top of Spark similar to transactional databases that handle the massive volumes of data and unifies streaming and batch processing. Delta Lake is built by the creators of Spark.
Data Lake Vs. Delta Lakehouse Vs. Data Warehouse:
Why Data Warehouses fail:
The main purpose to build Data Warehouses is to be able to generate analytics from a single place for the purpose of business decision-making and improve its performance. However, many Data Warehousing application fails due to 2 key reasons
(1) Upstream to downstream data curation and normalization: Data Warehouses store the curated data from the upstream as Data Warehouses are immutable. Will all the upstream be willing to do this and will they have the bandwidth to do this even if it is a Management mandate? — Not really. There will be data duplications across the organization and process overlaps that are built over several years. To identify the redundant processes and to build a unified system with upstream sending all relevant information is not a cakewalk (who said walking in cake is an easy task.. anyway). Building Data Warehouse has so many pre-cursors that need to be satisfied, which often is overseen or not addressed before the Data Warehouse gets started. So, Data Warehouse comes back to the drawing board very often than not. You would have seen the existence of 150 + Data Warehouses in a firm. If Data Warehouse is the single version of the truth and is “Central” to the firm then why there are so many Central Data Warehouses. To implement a Data Warehousing solution in a firm, for business, there is always a constant trade-off between emergency and importance. Data Warehousing is not a low-hanging fruit and its benefit will be realized over time. Any new functional-rich project will always be given importance by the business over Data Warehousing.
(2) Factor of Time: Building Data Warehousing takes time. While answering the essential questions of Why Data Warehousing? and what approach are we taking to build the Data Warehousing? (Inmon Vs Kimball or combination based on organization needs etc.), a clear message should be delivered that building Data Warehouses takes substantial time and effort and the benefits can be seen only in the long run. Many of the firms, keep this deadline as 1–2 years and marks the Data Warehousing as a failure, and moves on to building a new one within few years. Data Warehousing is a function of resources, time, quality, cost, technology, and most importantly data. Thousands and thousands of data points across several systems that need to be read, understood, curated, and normalized into a single warehouse is not an easy task!
What problems exist in Data Warehousing does not vanish just by introducing Data Lake or Delta Lake. However, the unification of data into a single place without the need to curate 1000’s of data points becomes quicker and easier. Also unlike Data Warehousing, unification of data will see the light of the day sooner as all the data in its original form is in a single place without any fancy modifications (with the ability to process structured and unstructured data and without the need to stick to proprietary file types). Building a Single version of the truth from this massive, unified data set by talking to one business at a time and bringing in only the essential fields becomes easier. Building Data Services for quick business use on top of this massive staging data becomes a lot easier.