Data Technology Trend #8: Data Next — part 5

This article is a part of a multi-part series Data Technology Trends (parent article). Previous article — Link: Data Technology Trend #7: Monetized. Previous part of this trend — Data Technology Trend #8: Data Next — part 4 and next part of this trend — Data Technology Trend #8: Data Next — part 6.

Trend 8.2: Streaming:

Streaming or messaging system transfers data from producer to consumer. There are two key types of streaming (1) Point-to-Point Messaging system (2) Publish-Subscriber messaging system. As with Modern Data Platform, Modern Data Streaming for both real-time and non-real-time are gaining traction.

Apache Kafka:

What is Apache Kafka:

The Kafka message broker architecture or the general Kafka architecture has only 3 key elements, the producer, Broker, and the consumer. The producer publishes or pushes messages into the Broker based on the topic and the Consumer consumers whatever topic they are interested in. All the applications, ML / AI / BI, any SQL / NoSQL databases, etc., can directly consume from Apache Kafka.

Scalability, Elasticity, and Flexibility get a good score. If not designed carefully, can get completely convoluted and becomes extremely complex with multiple queues in place and will run for the money for first-generation architecture instead of modern architecture.

The Kafka Ecosystem and Cluster architecture

Kafka Ecosystem: The Kafka ecosystem consists of Kafka Cluster, Producer, Consumer, and Zookeeper.

How does it work:

Producer publishes the message to broker’s topic partition. -> consumers consume the data. Zookeeper elects the leader and scales out and in.

As part of how Kafka works, I would like to discuss 3 key points

- Publishing and Subscribing using Workflows

- Partition, Leadership and Replication

- Event-driven architecture using Quorum controller (KRaft)

- Publishing and Subscribing using Workflows

- Partition, Leadership and Replication

- Event-driven architecture using Quorum controller. (KRaft):

Event-Driven Raft consensus is Kafka without Zookeeper. The Quorum controller is an Event-driven consensus

Refer: Confluent source.

Further reading: KIP-595 A Raft protocol for the Metadata Quorum

As per Kafka, a Quorum controller takes much lesser time to start off and shut down millions of partitions. This is more performant. Quorum controller stores its state using the event-sourced storage model.

Simpler operations, simpler deployment, tighter security, support for up to 2+ million partitions, single process execution.

What problem does it solve:

Data Streaming is not a new solution. Be it IBM MQ or any publisher/subscriber messaging, data streaming is time immoral. The new streaming not only enables transferring of messages from point A to point B but is slowly replacing the “Distributed system architecture”. While there are multiple reference architectures for Distributed systems such as (1) Shard (2) Streams (3) Databases

Key Features:

Kafka Connect

To be able to connect the data sources producers or data sinks with Kafka, then we have to use Kafka Connect.

Kafka Connect is a tool for scalability and reliability. It reliably enables stream data between source and target systems. Where massive data is involved, Kafka Connect can ingest the entire data set and can enable processing.

Event Streams:

Data management = Storage + Flow.

Source: Kafka summit 2021

Kafka is the central nervous system in the entire enterprise architecture. Apps, SAAS Applications, Databases, Data Warehouses, etc., connect to Kafka in producer and consumer capacity.

Ksqldb: Data in Motion + Data at Rest

KSqldb enables you to build modern, real-time applications with the same ease of querying from traditional databases. It is an event streaming database that helps to create stream processing applications on top of Apache Kafka.

You can build a streaming app with just a few steps:

1. Capture events

2. Perform continuous transformations

3. Create materialized views

4. Serve lookups against materialized views

Disaster Recovery for multi-region and multi Data-Centers

If one of the DC goes down, the Kafka should reliably work and should auto-heal to bring up the next DC to work efficiently. While setting this up manually might be a bit difficult, fully managed Kafka As A Service (MSK) in AWS / Azure / GCP is fully managed, automatically backed up, auto self-healing if one of the DC goes down, highly available, and highly reliable. The MSK in the public cloud migrates and runs existing Apache Kafka applications on the public cloud without any major changes and is fully managed in the sense that all the upgrades are taken care of — no operational overhead. To further process data streams, we can also use fully managed Apache Flink applications on many of the public clouds (AWS)

Refer: Source

Managed Kafka Services (MSK) — AWS MSK

Refer: AWS Source

Kafka Security

Data Governance is part of security, that meets data discovery

- Data Quality

- Data Catalog

- Security

- Data Lineage

- Self Service platform

- Data Policies

Kafka Monitoring

One of the ways to monitor Kafka is to measure specific metrics. All the metrics that are published by Kafka are accessible via the Java Management Extension interface. MSK in the public cloud will have provisions to view the metrics. There are several metrics sources such as Application Metrics, Logs, Infrastructure metrics, synthetic clients, and client metrics.

For other articles refer to

Application Development Head | Data Strategy | Big Data | Analytics & BI | Data Governance | Cloud