Data Technology Trend #2: Strategic (Part 4)

This article is a part of a multi-part series Data Technology Trends (parent article). Previous article Data Technology Trend # 1: Trusted and next article Data Technology Trend #3: Accelerated.

Trend #2.5: Data management (preparation & integration) tools

Data Management:

According to Gartner Hype cycle (2019), below are the top Data Management technologies.

Refer: Source Gartner Data Management Hypecycle 2019

Worth mentioning the below technologies:

Data Hub Strategy

Already discussed.

Data Catalog

Already discussed.

Data Classification

By default, all the data points for the organization must be classified as sensitive and confidential.

  • Public
  • Confidential
  • Sensitive
  • Personal

Classification levels:

C1: Contact information or PII, including name, address, telephone number, and e-mail address.

C2: Identity data, including gender and date of birth.

C3: Communication data between you and us, including recordings of calls to our service centers, e-mail communication, online chats, comments, and reviews collected through surveys or posted on our channels and on social media platforms.

C4: Digital information data collected when you visit our websites, applications, or other digital platforms, including IP addresses, browser data, traffic data, social media behavior, and user patterns. If you subscribe to our newsletters, we may collect data regarding which newsletters you open, your location when opening them and whether you access any links inserted in the newsletters.

DataOps

Will be discussed in the Democratization of data.

Data Fabric

This will be discussed in the Decentralization of data.

Augmented Data Management

Data Preparation

1. gather data

2. discover and assess data

3. cleanse and validate data

4. transform and enrich data

5. store data

Front runners: Alterix APA platform, Power BI, Tableau server, etc.,

Though these are BI solutions, they are also predominantly used for Data preparation.

Metadata Management solutions

Metadata management solutions deliver insights from data that is stored in the enterprise environment. This solution enables to search, locate, and easily manageable information needs for the organization. This in turn leads to better data governance and creates better opportunities for advanced and enhanced analytics. The Metadata Management solutions include Data Catalogues, tables, and other visual tools for processing information.

Sample 20 leading metadata management software:

10. Informatica

9. IBM

8. Alation

7. ASG technologies

6. Colliba

5. Infogix

4. Octopi

3. Alex solutions

2. Smartlogic

1. Erwin

Multimodel DBMS

1. A Multi-model database is a database that can store, index, and query data in more than one model.

2. for most of the part Databases have only one part such as — RDBMS, Document/graph, or triplestore. A database that combines many of these is multi-model.

Multi-model databases include but not limited to:

AllegroGraph — document (JSON, JSON-LD), graph

ArangoDB — document (JSON), graph, key-value

Cosmos DB — document (JSON), graph,[6] key-value, SQL

Couchbase — document (JSON), key-value, N1QL

Datastax — key-value, tabular, graph

EnterpriseDB — document (XML and JSON), key-value

MarkLogic — document (XML and JSON), graph triplestore, binary, SQL

MongoDB — document (XML and JSON), graph, key-value, time-series

Oracle Database — relational, document (JSON and XML), graph triplestore, property graph, key-value, objects

OrientDB — document (JSON), graph, key-value, reactive, SQL

Redis — key-value, document (JSON), property graph, streaming, time-series

SAP HANA — relational, document (JSON), graph, streaming

Virtuoso Universal Server — relational, document (XML), RDF graphs

CosmosDB, Data Lake & Delta Lake.

Graph DBMS

1. Graph database is designed to treat networks and relationships between the data as equal and important to data itself.

2. Intention is to hold data without constructing a pre-defined model.

Eg., Neo4j, Neptune, etc.

Application Data Management

1. Application Data Management (ADM) is a technology-enabled discipline designed to help users govern and manage data in the business applications such as ERP, Financial applications.

2. ADM is critical for digital transformation and other modernization initiatives.

3. Today ADM has emerged as a way to move beyond master data management to standardize and govern broader application data, and Winshuttle is paving the way forward in the digital era.

Blockchain

Discussed

Data Lakes

Already discussed

Master Data Management

1. Master data Management (MDM) is a technology disciple that ensures uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise’s official shared master data assets.

2. “Single Version. Of Truth” is the core concept and is difficult to achieve without proper architecture and stakeholder acceptance.

In-DBMS Analytics

1. An in-DBMS Analytics system contains an EDW (Enterprise Data Warehouse) integrated with an Analytic database platform.

2. Is mainly used for applications that require intensive processing. Once the datasets are effectively gathered in data marts this technology facilitates and secures data analysis, processing, and retrieval.

3. Key benefits — streamlines the. Identification of future business opportunities and risks improves organizational predictive analytics capability, and provides ad-hoc analytics reporting.

Logical Data Warehouse

1. A LDW (Logical Data Warehouse) is an architectural layer that sits on top of data warehouse that allows viewing data without transformation or movement.

2. Allows analysts and other business users to access data without. Formatting and. Eliminates the need to transform and consolidate data from disparate sources in order to view it.

3. It allows to provide a more holistic view of an organization’s data at any point in time regardless of where that data may reside.

Wide-Column DBMSs

1. Is a no-SQL database

2. Wide-column stores vs columnar Databases

a. Wide-column stores such as Bigtable and Apache Cassandra are not column stores as they do not use columnar storage.

b. Each column is stored separately on disk

c. Wide-column stores often support the notion of column families that are stored separately.

d. Within a given column family, all data is stored in a row-by-row fashion.

- Amazon DynamoDB

- Apache Accumulo

- Apache Cassandra

- Apache HBase

- DataStax Enterprise

- DataStax Luna

- DataStax Astra

- Azure Tables

- Bigtable

- Hypertable

- MapR-DB

- ScyllaDB

- Sqrrl

- ClickHouse

Document Store DBMSs

1. Is a type of non-relational database that is designed to store and query data as JSON-like documents.

2. Make it easier for developers to store and query data in a database using the same document-model format.

3. Flexible, semi-structured, and hierarchical nature of documents.

4. Allows evolving with applications.

5. Enable flexible indexing, powerful and ad-hoc queries, and analytics over a collection of documents.

Eg., MongoDB.

Operational In-Memory DBMS

1. An in-memory database management system (IMDBMS) is a database management system (DBMS) that predominantly relies on main memory for data storage, management and manipulation.

2. This eliminates the latency and overhead of hard disk storage and reduces the instruction set that’s required to access data.

Data Integration Tool

1. Data integration is a process (technology — integration tool is a software) of bringing data from different sources into a single destination.

2. Once the data is gathered in a single destination, meaningful insights are gathered. It would integrate the collected data such that data is comprehensive, reliable, correct, and current.

3. Organizations should be able to readily rely on business analysis and reporting.

4. Types of Data integration tools:

a. On-premise data integration tools

b. Cloud-based data integration tools

c. Open-source data integration tools

d. Proprietary data integration tools

Example: Pentaho, Informatica Powercenter, Talend, hevo Data etc.

Analytical In-Memory DBMS

Refer whitepaper

Data Encryption

Encryption at Rest — KMS

Encryption at Transit — TLS / SSL

TDE — Transparent data encryption

Symmetric and Asymmetric encryption

Data Virtualization

- Virtual views of the data

- No data is physically moved

- often the same as federated data

- Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data

- Unlike the traditional extract, transform, load (ETL) process, the data remains in place, and real-time access is given to the source system for the data.

- This reduces the risk of data errors, of the workload moving data around that may never be used. This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.

- Some enterprise landscapes are filled with disparate data sources including multiple data warehouses, data marts, and/or data lakes, even though a Data Warehouse, if implemented correctly, should be unique and a single source of truth.

- Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform.

In-Memory Data Grids

1. An in-memory data grid (IMDG) is a set of networked/clustered computers that pool together their random access memory (RAM) to let applications share data with other applications running in the cluster.

2. Though IMDGs are sometimes generically described as a distributed in-memory data store, IMDGs offer more than just storage.

3. IMDGs are built for data processing at extremely high speeds. They are designed for building and running large-scale applications that need more RAM than is typically available in a single computer server.

4. This enables the highest application performance by using RAM along with the processing power of multiple computers that run tasks in parallel. IMDGs are especially valuable for applications that do extensive parallel processing on large data sets.

5. Performant and can improve from nontraditional to in-memory by 100 or even 1000x faster.

References: multiple sources from the internet.

As part of a strategic trends, I would also like to include Data Fabric and Data Mesh. As these technologies in my opinion provides a more meaningful position in the decentralized trend have included there.

For other articles refer to luxananda.medium.com.

All the views expressed here are my own views and does not represent views of my firm that I work for. Data | Big Data | Cloud | ML