Data Technology Trend #8: Data Next — part 3

5 min readJun 21, 2021

This article is a part of a multi-part series Data Technology Trends (parent article). Previous article — Link: Data Technology Trend #7: Monetized. Previous part of this trend article — Data Technology Trend #8: Data Next — part 2 and next part of this trend article — Data Technology Trend #8: Data Next — part 4.

Trend 8.1 Unified and Enriched Big Data and AI — Delta Lake

Few impressive features of Delta Lake: (Cont..)

3. Unity Catalog

Databricks Unity Catalog is a brand-new feature introduced in 2021 (this article is written when there is a waitlist to sign-up for the solution). The very reason Data Lakes becomes data swamps is the missing part of governance and the ever-growing data into the data lakes and creating multiple buckets and providing file-based access rights to the end-user. If this problem can be sorted, the data swamps can be avoided and data lakes can be retained taking care of data quality. In order to achieve this, instead of giving permission at the file level, provide permission at the query level! Users can be of 3 types in any organization (1) Simple user (2) Power user and (3) Super user. To any of these users, to access the data if restricted at the query / table level, then the data lake need not be disturbed and altered.

Data Lake starts all neat…

Unity Catalog, as per Data Bricks is the world’s first unified catalog for the lakehouse.

Key features

1. One interface to govern all data assets

2. One security model based on “ANSI SQL”

3. Full integration with existing catalogs

As Data Lake stores everything as files, there are a bunch of users who will need different access rights to different files. In order to separate it, we tend to create multiple buckets and provide permission at the file level. Data Bricks saw this to be the starting problem where Data Lake gets murkier and introduced permission at ANSI SQL level instead of at file level. If a user has permission to run a query on the table, then they will have access to that data set. This is not a new concept, we have been dealing with Fine-Grained access permissions in the databases for several decades. But with this shift in the modern data platform and use of new technology, the same fine-grained access permissions need to be shifted and re-aligned for performance, ease of use and enabling the data lake not to get murkier. It addresses 3 main issues

1. What if users are interested only in few columns in a file/table

2. What if there is a change in the data layout

3. What if Data Lake and the tables go out of sync — which data to use and different governance model for different data technologies

4. What if there is a change in organization governance rules

Solution:

The time-tested solution of providing

- Fine-grained permissions on tables, fields, views, and NOT files

- Industry-standard — ANSI SQL grants

- Unified permission model for all data assets

- Centrally audited

Source: Databricks Data + AI Summit 2021

4. Delta Sharing

As an organization produces more and more data and with the organization’s regional spread and external accessibility of information, data needs to flow beyond borders within the same organization and external to the organization. Each country’s regulations and governance framework are varied, which technically does not allow the share of data within the same cloud provider or framework. The data will have to pass the borders, different public clouds, and yet safe and secured. There has to be a robust “sharing” framework in order to support this complexity. To solve this, Data Bricks has introduced the industry-first open protocol for secure data sharing called “Delta Sharing”.

What is Delta Sharing:

Sharing across borders within or outside the organization is increasingly complicated due to multi-country regulations and different governance structures adopted in the cloud. Delta Sharing has the following goals (as per Data Bricks)

1. Share existing, live data in data lakes/lakehouses (no need to copy it out)

2. Support a wide range of clients by using existing, open data formats

3. Strong security, auditing, and governance

4. Efficiently scale to massive datasets

Key features:

- Fully open, without proprietary vendor lock-in. Vendor-neutral OSS governance model.

- Not restricted to SQL, full support for Data Science.

- Easily managed privacy, security, and compliance.

- A vibrant ecosystem that integrates across all clouds.

How does it work:

Delta Sharing is also another efficient and simple design that can be used even if an organization is not using Data Bricks.

Say a user from either internal/external organization is requesting for a data set, the related access permissions will be checked by the data provider and a “temporary short-lived URL” with the actual data which is ringfenced with the required security protocols that have been provisioned for S3.

What problem it solves:

It does not distinguish region, individual or cross firms, the quantity of data — if the user has permission to access the underlying data, it simply sends a short-lived URL that needs to be actioned upon by the recipient. This helps in multiple ways beyond sharing of data — individual organizations or regions need not build separate Data Warehouses if they are using the data as is — if they are processing the receiving data set and to generate ML / Analytics out of it, it is simply an extension of the Data Lake and they, in turn, can use the similar Delta Sharing principles and server to share the final data to their downstream consumers. Is it good or it is good!

- As a concept can be adopted with or without Data Bricks.

- Provider can share a “single version of the truth” of the underlying table/file/partition

- Live sharing of the table and can have ACID consistency!

- Any client who reads Parquet can support Delta shares

- Better, faster, cheaper, smart, reliable, parallelism possible using any modern cloud file system

We shall discuss how to share the table live in the “Delta Live Table” later in the section.

Delta Sharing Ecosystem: