This article is a part of a multi-part series Data Technology Trends (parent article). Previous part — Data Technology Trend #2: Strategic (Part 1) and Next part — Data Technology Trend #4: Decentralized.
Trend #3.1: Optimization of Data. Long Live Streams (Realtime) and Dead ETL(?)
Idea is to remove latency that can be done in multiple ways. Today the Cloud Streaming (be it Kinesis or MSK can process petabytes of data). And with the Querying engine capabilities, ETL is long dead and ELT is the way to go. Transformation while querying directly unless you want to store the transformations separately.
Streaming or messaging system transfers data from producer to consumer. There are two key types of streaming (1) Point-to-Point Messaging system (2) Publish-Subscriber messaging system. As with Modern Data Platform, Modern Data Streaming for both real-time and non-real-time are gaining traction.
Streaming — Apache Kafka:
The Kafka message broker architecture or the general Kafka architecture has only 3 key elements, the producer, topic, and the consumer. Consumer consumers whatever topic they are interested in, and the producer produces/generates and publishes data source to the topic. All the applications, ML / AI / BI, any SQL / NoSQL databases, etc., can directly consume from Apache Kafka.
Scalability, Elasticity, and Flexibility get a good score. If not designed carefully, can get completely convoluted and becomes extremely complex with multiple queues in place and will run for the money for first-generation architecture instead of modern architecture.
Managed Kafka Services (MSK) — AWS MSK:
Data Governance is part of security, that meets data discovery
- Data Quality
- Data Catalog
- Data Lineage
- Self Service platform
- Data Policies
More on this will be discussed in Trend #8.
ETL — Glue:
Glue is a fully managed ETL and is based on Spark. This is also a serverless spark that is used for Big Data process. This is designed to work with structured / semi-structured processing. Enables to discover data, transform data, and query data. While you can do crawl and catalog, creating event-driven ETL pipelines with Glue is a very important and effective solution to read data once they are available on S3. Creating a pipeline using Glue and writing transformation logic using Python enables faster processing. You can also register the data with the Glue catalog for metadata management — the files can be crawled for this metadata management.
Trend #3.2: Augmented Data Management
1. Database systems
2. Master Data and Metadata management
3. Quality control
4. Integration definition
5. Integration Definition
Augmented Data Management — ML-based Data Management is getting more traction.