Towards Data & Cloud #5: Data Ingestion Frameworks (Part 1)

Framework, Tools & Technology Focus

LAKSHMI VENKATESH
6 min readMar 19, 2024
Data Management Hype Cycle 2023

The Gartner Hype Cycle 2023 is a graphical representation of the life cycle stages a technology goes through from conception to maturity and widespread adoption. There are various Gartner hype cycles available, the data management hype cycle overlaps with the Architecture to some extent.

The “Innovation Trigger” phase, where new technologies make their debut, sparked by innovation and early publicity. Here, you can see technologies like Vector Databases and Generative AI for Data Management beginning their journey.

As we move up the curve to the “Peak of Inflated Expectations,” technologies experience a surge in popularity and high expectations due to media hype and early adopters’ interest. In this phase, we find concepts like Data Fabric, Ledger Databases, and Lakehouse, which are gaining significant attention and promise to solve a variety of data management challenges.

After the peak, technologies inevitably fall into the “Trough of Disillusionment,” where the initial excitement fades due to failed projects, technical challenges, or simply not meeting the high expectations. Some technologies indicated to be in this phase include Augmented Data Catalogs and Knowledge Graphs.

Technologies that survive the trough eventually climb the “Slope of Enlightenment” as understanding and practical applications improve. This is where organizations start to learn how to effectively implement these technologies. Data Lake as a Service and Event Stream Processing in DBMS Analytics are examples of technologies on this upward slope.

Finally, we reach the “Plateau of Productivity,” where the benefits of the technology become widely demonstrated and accepted. Here, technologies are firmly established in the market. According to the Hype Cycle, technologies like SQL Interfaces to Object Stores and Multimodel DBMS are reaching this plateau, suggesting they are maturing and gaining mainstream adoption.

Data Architecture Frameworks related to Data Ingestion / Processing:

Data Architecture Framework inline with Data Management Hype cycle

Data Ingestion Frameworks:

Batch Architecture

Batch processing architecture is a data processing method where data is collected over a period and then processed all at once. This approach is useful for operations that don’t need immediate response times and can be scheduled to run at specific intervals, like nightly or weekly. Here’s a simplified breakdown:

  1. Data Collection: Data is gathered and stored until there’s enough to begin processing. This could be user data, transactions, logs, etc.
  2. Data Processing: Once enough data has been collected, it’s processed in a large, single batch. This might involve tasks like analysis, transformation, or aggregation.
  3. Data Storage: After processing, the data is stored in a database or data warehouse where it can be accessed for further use, such as reporting or insights generation.
  4. Scheduling: Batch jobs are typically scheduled to run during off-peak hours to not interfere with live systems and to make efficient use of computing resources.

Batch architecture is ideal for scenarios where it’s not critical to have real-time data processing and where processing large volumes of data at once can lead to more efficient use of resources.

Streaming — Real-Time Architecture

Streaming or real-time data processing architecture handles data continuously as it arrives, rather than waiting to collect data before processing. This approach is essential for scenarios where immediate processing and action on data are required. Here’s a straightforward explanation:

  1. Data Ingestion: Data is ingested in real-time from various sources, such as sensors, logs, or user activities.
  2. Data Processing: As soon as data arrives, it’s processed immediately. This can involve filtering, aggregating, or analyzing data on-the-fly.
  3. Data Storage: Processed data can be stored for longer-term analysis or immediately acted upon. Some systems may only store data after it’s been processed, depending on the requirements.
  4. Decision Making: The immediate output from processed data can trigger actions, alerts, or decisions without delay.

Streaming architecture is crucial for applications that rely on the timely use of data, such as fraud detection, live recommendations, or monitoring systems, where responding quickly to new information is critical.

Should the Batch and Streaming be two different ingestion and processing architecture patterns or should be an integrated single architecture pattern to support the Big Data processing is a constant question.

Lambda Architecture

Lambda Architecture

Lambda Architecture is designed to handle massive volume of data using both batch and real-time streaming methods. This is to enable latency, throughput, fault tolerance and scalability. It helps both batch (Single source of truth) and on-demand data (multiple versions & source of data). While the two streams of data can be stored in order to retieve point in time data, the outputs can be joined for the presentation / consumption layer.

Lambda Architecture key principles:

  1. Immutability of Data: Emphasizing data’s unchanging nature, ensuring reliability and consistency.
  2. Data Denormalization: Simplifying access by integrating data in a unified format.
  3. Dual Model: supporting both the streaming (multi-versions) and batch (point in time) data models.
  4. Precomputed Views: Facilitating quicker data retrieval through pre-calculated results.

Exploring the Layers — Three-tiered structure:

  • The Batch Layer (Cold Path): This foundational layer focuses on comprehensive data processing, handling massive datasets with a focus on accuracy.
  • The Speed Layer (Hot Path): For real-time data needs, this layer provides immediate processing to deliver up-to-the-minute insights.
  • The Serving Layer: Acting as the bridge, this layer merges outputs from both the batch and speed layers to present a unified view.

The Impact of Lambda Architecture

Adopting Lambda Architecture means embracing scalability, where systems expand outwards smoothly to manage growing data loads. It’s a strategy that reduces latency, minimizes errors, and ultimately, paves the way for a linearly scalable system that meets the demands of modern data processing challenges.

Kappa Architecture

In the landscape of Big Data, the answer for search for efficient batch + real-time in the same architecture pattern is Kappa Architecture. Imagine a single, flowing river of data — this is the core of Kappa Architecture. It discards the traditional two-river approach of Lambda Architecture, which divided data into a fast-moving stream and a slow-moving lake (batch processing). Instead, Kappa keeps things simple with an append-only, immutable log that acts as the singular source of truth for all incoming data.

Kappa Architecture

Kappa Architecture key principles:

  • No More Double Trouble: With Kappa, you avoid the complexity of managing two separate systems for data processing. There’s no need to duplicate logic between real-time and batch processes.
  • A Single Source: Think of the append-only log as a continuously updated ledger, providing a transparent, time-ordered sequence of records that are permanent and unchangeable.
  • Speedy Streams: By channeling data directly through a computational system, Kappa enables real-time processing with remarkable speed — turning what used to be a batch process into a swift flow of insights.

Once the data flows through the computational system, it’s ready to be served up. Kappa architecture makes use of auxiliary stores to provide accessible and queryable data, ensuring that the information you need is always at your fingertips, updated in real-time. Major players like LinkedIn and Yahoo have turned to Kappa Architecture to handle their massive data streams. They’ve found that the real-time nature of Kappa not only simplifies the data processing landscape but also delivers insights at the speed required for today’s data-driven decisions.

Kappa Architecture marks a significant step towards real-time data processing simplicity. By focusing on a streamlined path, it offers an approachable, maintainable, and efficient framework for businesses that thrive on immediate data-driven insights. As we move further into an era where speed and simplicity are king, Kappa Architecture stands out as the modern knight of Big Data.

Continued in Part 2 — Link

--

--

LAKSHMI VENKATESH

I learn by Writing; Data, AI, Cloud and Technology. All the views expressed here are my own views and does not represent views of my firm that I work for.