Towards Data & Cloud #4: Data Ingestion Wars (Part 2)

Framework, Tools & Technology Focus

LAKSHMI VENKATESH
6 min readMar 18, 2024

continued from part 1….

Data Streaming Architecture

Data streaming has changed a lot, especially in the last ten years. Let’s break it down simply.

What Changed with Data Streaming?

Before, data streaming was mainly about sending messages from one system to another. Now, it’s much more. Streaming can handle, process, store, and even analyze data in real-time. Think of it like this: in the past, streaming was just passing a note from one person to another. Nowadays, it’s like having a super-smart robot that can read the note, understand it, store it, and even tell you interesting things about it right away.

Comparison with the Tokyo flood storage system

During a Tea-time conversation with a friend in 2019, he brought an interesting analogy of “Tokyo Flood Storage System vs today’s Data Streaming architecture”. This anology stuck with me ever since.

When the Tunnel is filled with water (Temporary Storage)

Tokyo’s flood storage system, particularly the Metropolitan Area Outer Underground Discharge Channel, showcases a supremely well played architecture in tackling natural disasters through advanced engineering. Initiated in 1992 and completed in 2006, this massive project, often referred to as the “world’s largest drain,” connects rivers and waterways to a network of tunnels and giant cisterns deep underground, designed to prevent floodwaters from overwhelming the city. When the flood storage system is empty (temporary storage is done), you can take a tour of the Underground temple. This underground marvel consists of 6.4km of tunnels up to 50m deep, connecting five giant silos to a massive tank known as the “Underground Temple” ​​​​.

When the Tunnel is empty, you can take a tour!
Source

In contrast, data streaming architectures and technologies such as Kinesis, Kafka or Flink address the challenge of managing and processing the continuous influx of data generated by source systems or IoT devices. Once all the messages are read off and if the storage time is short-term, the streaming system will be free — if the storage is long-term, the data will reside in the streaming infrastructure. While Tokyo’s flood system deals with physical water flow, data streaming handles the flow of information, allowing for real-time data processing, analysis, and storage. Both systems, though in completely different realms, embody the principle of controlling and utilizing flows — whether of water or data — to benefit society and prevent potential disasters.

Why Kappa and Lambda Matter and why do we care

  • Kappa is about doing everything in real-time, treating all data the same, whether it’s coming in now or it’s older.
  • Lambda splits things up, handling real-time data one way and taking care of stored, or “batch,” data differently.

As businesses grow, the amount of data they deal with explodes. They need to scale up, meaning their systems have to get bigger and smarter to handle more data. Plus, they need to make sure the data is consistent and accurate, which is super important for things like payments, where you can’t afford mistakes.

The Cool New Stuff: Analytics and Learning

The latest streaming systems don’t just move data around. They can analyze it on the fly and even learn from it, helping businesses make smart decisions in real-time. It’s like having a team of analysts and learners working non-stop, digging into your data as it comes in.

Choosing the Right Tool

With so many streaming options out there, picking the right one depends on what you need. Some systems are better for simple tasks, while others can handle complex analytics or learn from data. The key is to know what you need and choose the tool that fits best.

Choosing the right tool that is a best fit in the world of divergence and convergence — This is the key catalyst of Data Streaming wars!

Data Streaming Wars

Decoding the Data Streaming Wars: The Battle to Process the World’s Data

The realm of data streaming has exploded into a battlefield, with technologies vying for dominance in how we process the ever-growing deluge of information. The so-called ‘Data Streaming Wars’ are not mere hype but a reflection of the shifting paradigms in data handling and processing necessitated by the rise of big data.

Why the War and why is it heating up?

The ‘war’ stems from a simple truth: the architecture of data streaming has evolved dramatically. Once a straightforward message-passing layer, it’s now an intricate ecosystem where every company is shifting from isolated products to comprehensive platforms. The race is on for supremacy in performance and resilience — providers are relentlessly innovating to deliver data faster and more reliably. The battle lines in the data streaming wars are drawn not over mere territory, but over the real-time pulses of global data — the lifeblood of modern enterprises. This clash is ignited by the voracious appetite for instant data insights, where every millisecond of delay can mean a missed opportunity.

The Catalyst for Conflict

Imagine a world that pulses with data — a digital heartbeat that races with every click, swipe, transaction, and sensor signal. In this world, data isn’t just an asset; it’s the currency of influence and power. Companies that can harness this relentless stream in real-time can predict trends, adapt strategies instantaneously, and outmaneuver competitors. But the catch? The technology to capture, analyze, and act on this data must be lightning-fast, scalable, and ever-adaptive. This high-stakes scenario is the powder keg for the data streaming wars.

A Race Against Time and Data

The battleground is everywhere — financial markets trading in microseconds, e-commerce platforms personalizing shopping experiences, and IoT devices coordinating global supply chains. Businesses demand the ability to process and glean insights from their data at breakneck speeds. Providers are locked in an arms race, relentlessly pushing the limits of technology to feed this need for speed. In this war, the victor claims more than market share; they become the architects of the future. The streaming platform that prevails shapes how we interact with technology, make decisions, and understand the world. It’s a war not just for profit, but for influence over the digital landscape of tomorrow.

Who’s Battling It Out?

The combatants are diverse, with seasoned veterans like IBM’s MQ and RabbitMQ facing off against modern juggernauts such as Kafka, Flink, and cloud giants’ offerings like AWS Kinesis, Azure Event Hubs, and GCP Pub/Sub. Each platform has its strengths, whether it’s Kafka’s robust community and ecosystem, AWS’s integrated suite of services, or Azure’s seamless integration with other Microsoft products.

Sustaining the Fight

To stay ahead, providers diversify their services, catering to everything from transactional data streams to media streaming and complex analytics. The move towards hybrid and multi-cloud offerings is also a strategic maneuver to cater to a broader market and ensure resilience against the rapidly changing tech landscape.

Navigating the Battlefield

For consumers, the key to navigating this war is to stay informed. Understanding the capabilities, trade-offs, and future directions of streaming platforms is essential. Architectural decisions made today must be agile enough to adapt to tomorrow’s changes — flexibility is the name of the game.

AWS:

Azure:

GCP:

Looking Ahead

Data streaming today is a trifecta:

  1. Temporary Storage: Buffering data for multiple reads and replays.
  2. Real-Time Analytics: Processing data on the fly for immediate insights.
  3. Persistence: Ensuring data longevity by moving it to long-term storage.

Each player in the streaming wars brings a unique approach to these functionalities, and choosing the right platform depends on a deep understanding of your data needs and strategic goals.

The data streaming wars are emblematic of our era’s technological evolution, reflecting the vital importance of data in modern business strategies. As we watch these titans clash, one thing is clear: the real winners of this war will be the businesses that leverage these platforms to turn real-time data into real-world value.

--

--

LAKSHMI VENKATESH

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.