Data Technology Trend #0: Foundational (Part 1)
This article is a part of a multi-part series Data Technology Trends (parent article).
The foundational phase, Trend #0 — where the Data technology is matured and is used as mainstream (many of the technologies supposed to be dead got back to business) and has reached the plateau of productivity (Gartner’s term) — thanks to the modern cloud data platforms.
The advent of Big Data using Spark, Data Lake and AGI (Artificial General Intelligence / Machine learning) used as mainstream is quite matured.
Data Ethics, Data Privacy, Data Governance, and Security have gained more traction and moved towards maturity in recent years. While Finance has always had the Data Governance, now with the advent of more Social media data (and as we enter into the Digital Twin era) and the like, Data security and Governance which was much prominent in Finance and Health Care is now much wider and this is the new normal.
Trend #0.1: Abandon / Migrating from Hadoop for Spark and Modern Cloud data platforms.
Even though Hadoop is not eliminated as the use case for extremely huge and complex data sets, Hadoop is the way to go as the Spark’s in-memory may not be a good idea for complex data processing and calculations to hold, Hadoop has still its place in the big data processing. However, with the advancement of Spark and Data Bricks, the adoption of Big data is improving and those organizations/projects that were using Hadoop, if possible, via Spark is abandoning Hadoop for Spark and Databricks. Shifting towards modern cloud platforms.
Taking the high ground, Before the advent of Big Data there were many efforts on the standardization with the 360, Online Transaction processing streamlining and Data Warehousing, entry of Teradata and MPP processing, etc. Then came Hadoop, the Big Data. With IBM and the Hadoop, Cloudera, and Horton Work the world was moving towards the Big Data era yet the majority of the Big Data was in the hands of Java Developers with few of Pig and Hive like with actual data folks. With Spark, the entire stack of Big Data revived, and with the current modern cloud data platforms now, big data has got the second life completely. Spark becomes the first choice for Big Data Applications and organizations are either abandoning/migrating from Hadoop to Spark and Modern Cloud Data platforms.
Spark provides a complete ecosystem for both streaming and batch data processing along with the ability to perform Analytics (machine learning) on the data. Compared to the Hadoop environment and the complex map-reduce, the modern-day engines are built on top of the map-reduce and are memory-optimized to enabled 100X processing promise.
Spark was the first unified analytics engine that facilitates data processing, SQL Analytics, and ML.
From the speed, cost, ease of use, caching, ease of access, and data protection, Spark is more advantageous and enables quick adoption. With the Spark ecosystem available in most of the modern cloud platforms such as AWS, Azure etc. the adoption of both Serverless (eg., AWS Glue spark) and Server-based (eg., AWS EMR) is increasing widespread.
Modern Cloud Data Platform:
Modern Cloud Data platforms are taking organizations for a spin!
1. Delta Lake by Data Bricks:
Customers are increasingly migrating to modern cloud data platforms such as Data bricks and are experiencing up to 50% performance improvement in runtime, 40% lower infrastructure costs, 200% data processing throughput, much more secure environment etc. Source.
Increasingly easy to manage the Big Data and Analytics Change Management process.
Delta Lake by Data Bricks — Reliable Data Lakes at scale. It is built on the lakehouse architecture and is growing as one unified platform for Data and AI. Spark shifted the focus from Hadoop and HDFS and made the use of Big Data mainstream. The next apparent step is the integration of Big Data and AI in which Data Bricks is already providing a unified solution and it is only a matter of time where it becomes mainstream. Delta Lake promises to combine the performance of Data Warehouse with the flexibility of the Data Lake. More on this in Trend #8.
Heatwave, is Oracle’s brand new integrated, high-performance analytics engine for MySQL Database service. It reduces the distinction between OLAP and OLTP and enables running both the workloads directly from the MySQL database directly eliminating the need to perform complex moves and re-compute etc., There is no need to have a separate analytics database. This service has been introduced in the Oracle Cloud Infrastructure (OCI).
While Oracle Cloud platform, Mango DB Atlas, CosmoDB by Microsoft, AWS Data and Big Data platform, Cloudera Data platform etc., are no question best of the breed, I see the growing trend of Data Bricks Delta Lake and Snowflake to be most promising and next get Cloud Data platforms.
4. Redshift AQUA:
Redshift AQUA (Advanced Query Accelerator) is a brand new cache that is hardware and distributed accelerated that according to Amazon Redshift runs upto 10 times faster than the other cloud data warehouses. This is done by
- optimally boosting certain types of queries
- included with Redshift RA3 (separating storage and compute) xl and 16xl node types at no extra cost
What problem it tries to solve?
- CPU in-memory processing: CPU inmemory processing bottlenecks are currently sorted by having a seperate cache and other techniques, but with the storage continuing to increases and customers have 7 layers of sub-customers with several millions of information processing and passing through CPU in-memory processing bottlenecks breaks the ability to provide an optimally charged solution.
- Network Bandwidth: Current Modern Cloud Data warehouses are sophisticated yet have Network bandwidth as a bottleneck and multiple cloud providers such as Data Bricks are trying to solve it.
Trend #0.2: Baby Big Data / Data Lakes
Instead of constructing different architectures for intra-day processing, batch processing, and Big Data, the trend is moving towards start with Baby Big data and keep increasing the Technology landscape in Big data as consuming a huge amount of data. So here scaling from MB’s to Exabytes becomes easier. Especially for those organizations who are starting up, this is an ideal approach.
As you move towards the Data Lake First architecture, building Data platforms and applications on top of it become much easier and more efficient.
Summary: With the of Big Data using Spark, Data Lake and AGI (Artificial General Intelligence / Machine learning) used as mainstream is quite matured. This is has made “Data” got back to business and “Data is the new Oil” which was on paper till yester years is the new-reality now.
A wholesome Data profiling for organization is now possible and get the maximum and advanced use of data that fits the bill. Ride the wave of Modern cloud Data Platforms!
For other articles refer to luxananda.medium.com.