Data Technology Trend #8: Data Next — part 2
This article is a part of a multi-part series Data Technology Trends (parent article). Previous article — Link: Data Technology Trend #7: Monetized. Previous section of this article — Data Technology Trend #8: Data Next — part 1and next part of this article —Data Technology Trend #8: Data Next — part 3.
Trend 8.1 Unified and Enriched Big Data and AI — Delta Lake
Few impressive features of Delta Lake:
The Data and AI Summit of 2021 announced several key features. Have listed few impressive features + the ones introduced recently. Of the other main features such as metadata handling, time travel, support for CRUD, etc. below are the notable and impressive features.
1. The Lakehouse architecture -> Delta.io
A Lakehouse architecture consolidates/integrates the entire data sourcing till consumption. Delta Lake is an open-source project using which you can build Lakehouse Architecture. Delta Lake is a unification of Data Warehousing, Data Lake, and Analytics which can be built using the tools supplied by any of the modern cloud providers such as AWS / Azure or can be built from Data Bricks opting the tools or can be built using other modern technologies.
The Lakehouse architecture as a concept is not new. Ever since we started processing and ingesting the data and processing the same, the ultimate aim of the Datastore is to gain meaningful and actionable insights out of the data and make use of the data for business growth and performance. While it happens in silos across the organization in multi-tier mode, the introduction of Delta Lake, puts a structure around it and makes the data source, store, and share at ease. Based on the paper by Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia, the below diagram depict the evolution of Delta Lake aka Lakehouse Architecture.
Even though many of the organization may be at first a two-tier architecture, the good news is that shifting to Lakehouse architecture and creating a Lakehouse architecture out of it is though complicated due to the massive and disparate data points, is not cumbersome or complex. Lakehouse Architecture is nothing but a complete bundle of Data Management systems that focuses on sourcing till the consumption of data and hence simplifies the complete Data management task. Is Delta Lake, the open-source project first of its kind — not really, however Delta lake is more popular due to its innate benefits such as ACID, Meta Data management, simple store, and share, etc. Refer to this article for Comparison between Hudi, Iceberg, and Delta Lake.
Technology stack for Delta Lake:
We can also directly use DataBrick’s interface and rich toolset to build sophisticated platforms. Most of the popular public cloud provides tight integration for DataBricks solutions.
2. ACID Transactions
While the Lakehouse architecture enables you to perform everything from sharing to consuming and most importantly as part of the consumption layer, enables users to perform BI, SQL Analytics, Machine Learning, and analytics. All this is to do with Data Warehousing and Analytics (the OLAP, BI, and the Analytics Layer) — fine. But, what about the transactional databases OLTP? Does Delta Lake provide the power of OLTP + OLAP + BI & Analytics, then only you can call it a Unified Modern Data Platform, isn’t it. True. With Transactional databases OLTP, the challenge is its Mutability — ability to append/modify the data, applying the archive log to the correct change reference number, and providing the ability to perform real-time operations and consistency and yet at the same time addressing the performance and scalability/elasticity and keeping the cost in check with utmost best data quality in place sounds not possible isn’t it.
How about having an ACID Transaction on top of Spark? Spark addresses all the problems of scale, performance, data pipelines, etc., and ACID solves the problem of consistency. That is the promise of Databricks. Along with the ACID consistency, the lakehouse architecture also supports partitioning, indexing, schema validation, and handling large metadata.
The single most important transition that has happened with the advent of Data Lake is for the organization is creating a massive staging area and operating out of the data lake. This essential shift has created one more layer on top of it and the Lakehouse architecture that enables an open format, common storage of data with varying quality and process points of data such as Bronze, Silver, and Gold and provides a neat and cleanability to build BI / Streaming Analytics / Data Science or ML solutions out of it. For the next level of processing take the data as it is suitable — be it bronze or silver or gold. The thought process of re-using the streaming pipeline to build applications has shifted to using Bronze, Silver, and Gold Data sources.
Even though this topic is about Lakehouse architecture, we cannot completely ignore the Streaming architecture and talk only about the Lakehouse architecture. For designing Distributed File Architecture, there is more than one way to
Method 1: Use of Streaming / Brokers:
Discussed later in the section.
Method 2: Use of Lakehouse architecture:
Yet another way of efficient pipelining and storage of data is using the Lakehouse architecture. Similar to Apache Kafka streaming, the Lakehouse architecture can also be divided into 3 layers, producer, store, and consumer. The storage is split into 3 Bronze — raw data; Silver — filtered/cleaned/augmented data; Gold — processed data. Based on the business need, data can be retrieved from any of these buckets and shall be used and re-used.
Scalability, Elasticity, and Flexibility get a good score. In terms of future growth and inclusion of new business units, need not completely add new Data Lake solutions as we add a queue, it is just adding a few buckets and provisioning to the business unit in the current data lake space. Art is not to make the data lake unmanageable and make it a data swamp which will be disastrous.
If we pull out one core issue that converts a Data Lake into Data Swamp is its segmentation and the need to provide access rights to different business units based on the underlying data set — Data Governance. If this can be addressed, the usage of Data Lake can be at its optimal with high data quality. Simplify the sharing using the “Unity Catalog” — this is one good concept that can be adopted regardless of whether we use data bricks or not. It is so simple and efficient.
For other articles refer to luxananda.medium.com.