Data Technology Trend #8: Data Next — part 4

LAKSHMI VENKATESH
4 min readJun 21, 2021

--

This article is a part of a multi-part series Data Technology Trends (parent article). Previous article — Link: Data Technology Trend #7: Monetized. Previous part of this trend — Data Technology Trend #8: Data Next — part 3 and next part of this trend — Data Technology Trend #8: Data Next — part 5.

Trend 8.1 Unified and Enriched Big Data and AI — Delta Lake

Few impressive features of Delta Lake: (Cont..)

5. Delta tables to Delta Live Tables

As we saw earlier, the foundation of Lakehouse architecture is having Bronze — row data; Silver — filtered, cleaned augmented data, and Gold — Business level aggregates. This is the simplest form. But in reality, as the producers increase and consumers increase and if we are not adopting any of the modern features such as Unity Catalog, we may end up having multiple Bronze, Silver, and Gold buckets. This makes it difficult to maintain a reliable version of data and the Data Lake will soon end up being Data Swamp. In order to preserve the single version of the truth and the reliability of data, Databricks announces “Delta Live Table”, a reliable ETL made easy with Delta Lake.

What is Delta Live Table:

Delta Live Table, as the name suggests, shares the live data as and when some changes happen to the underlying data set.

Key Features:

1. Delta live tables understand your data pipeline

2. Live table understands your dependencies

3. It does automatic monitoring and recovery

4. Enables automatic environment independent data management

a. Different copies of data can be isolated and updated using the same code base.

5. Treat your data as code

a. Enables automatic testing

b. Single source of truth more than just transformation logic

6. Provides live updates.

How does it work:

Creates live tables directly on the underlying file/table. As data modifications happen to the underlying data source, the same is reflected. To the live tables. For sharing the Delta Live table can be shared via. Delta Sharing will ensure that live data is shared as the request is made.

What problem does it solve:

From Query to Production, though with Delta Sharing and Unity Catalog is a simple and easy job, sharing data and gathering analytics on terabytes/exabytes of data that should reflect the live updates every time there have been updates to the underlying (something like Material Views) on loads of data is not an easy task — it has so many operational challenges and will lead to performance bottlenecks.

Source: Databricks Data + AI Summit 2021

Key challenges it solves:

1. Enable data teams to innovate rapidly

2. Ensure useful and accurate analytics, BI with top-notch data quality

3. Ensures single version of the truth

4. Adopts to organization growth and the new addition of data.

6. Data Bricks Machine Learning

6. MLFlow:

DataBricks Machine learning has put together a stack called the “Managed MLFlow” and has introduced a couple of new features “AutoML” and “Feature Score”. MLFlow is an open machine learning platform introduced back in 2018–19.

- AutoML: Automates machine learning by quickly enabling you to deploy the models. Most part of preprocessing, feature engineering, and training have been automated hence saves time and focus on the quality of outcome and give bandwidth for resources to focus on solutions such as Explainable AI. Auto logging, Tracking, integration with PyCaret ML library, and deployment backends (along with Kubernetes, Docker, Spark, python, Redis, etc., have included Ray, Algorithm,a and Pytorch). are new in MLFlow. Data Bricks has a promising roadmap to the managed MLFlow.

- Feature score: Databricks has introduced a Feature online store. As part of the MLFlow, if the model is trained based on the Feature Store, the model itself will lookup for the features for the feature store.

Source: Databricks Data + AI Summit 2021

MLOps:

MLOps using Databrick’s MLFlow becomes simple and efficient.

MLOPS = DataOPS + DevOps + ModelOps (Delta Lake + Git Repos + MLFlow).

7. Databricks integration in public cloud

Databricks on Azure: Accelerate Data-driven innovation with Azure Databricks.

Databricks on AWS: Simple unified platform seamlessly integrated with the AWS Services.

Databricks on GCP: Databrick’s open lakehouse platform is fully integrated with GCP’s data services.

For other articles refer to luxananda.medium.com.

--

--

LAKSHMI VENKATESH

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.