Domain-Driven Data Fabric
Here is yet another buzzword: Data Fabric. Albert Einstein said that all objects in the universe sit in a smooth fabric called the space-time fabric. These objects create curvature just as a ball on a cloth. Heavier objects create larger curvature and smaller objects create a smaller curvature. The smaller objects circle around the heavier object along the larger curvature. These curvatures create the space-time curvature. The notion of data fabric has similar idea but unlike there are no standard model of data fabric or even an understanding of what does it mean to have a data fabric. The notion of data fabric propped up due to data lake.
A data lake is a conglomeration of different types of data in a single and global repository. The main problem with data lake model is how to interpret the data in the repository. It is a free for all interpretation. Every data model also comes with operations such as CRUD (Change, Read, Update and Delete) operations. These operations and with the underlying data create both structural and semantic models. These structural and semantic models provide certain guaranteed properties such as consistency, integrity, atomicity, etc. Data lake models leaves many these properties to the application using the data lake. Data lake became popular with NoSQL world. There are many variations of data lake models. For instance, delta lake provides data consistency when data in the data lake are updated, like how data warehouses handle changes. These changes can be type 1, type 2, and other types of changes.
The SQL world is very popular in transaction processing and data warehouse models. The underlying data repository is a relational database. The relational databases provide both structural and semantic models. The structure of a relational database is a set of tables that relate to each other via primary and foreign keys, creating an entity-relation graph model (also called as entity-relation or ER diagrams). The underlying semantics of relational databases is based on relational logic.
A data fabric is a data model and a repository that has richer structure and semantics than data lakes (including delta lake) and satisfies a set of invariant properties and other data properties. With this definition, the SQL world with relational database is a data fabric, whereas data lake is not a data fabric.
Let us dive a bit more into data fabric models. Here are few definitions of data fabric.
NetApp definition: A data fabric is an architecture and set of data services that provide consistent capabilities across a choice of endpoints spanning hybrid multicloud environments.
Gartner definition: A data fabric is a design concept that serves as an integrated layer (fabric) of data and connecting processes.
IBM definition: A data fabric is a data management architecture that can optimize access to distributed data and intelligently curate and orchestrate it for self-service delivery to data consumers.
Tibco definition: Data fabric is an end-to-end data integration and management solution, consisting of architecture, data management and integration software, and shared data that helps organizations manage their data. A data fabric provides a unified, consistent user experience and access to data for any member of an organization worldwide and in real-time.
Google cloud platform (GCP) provides Dataplex as an intelligent data fabric as a way to centrally manage, monitor, and govern your data across data lakes, data warehouses and data marts, and make this data securely accessible to a variety of analytics and data science tools.
Looking at the above definitions and capabilities, data fabric is yet another buzzword beyond data lake (which is itself a buzzword). The relational database and the underlying relational model were built on solid structural and semantic foundations. We need to do something similar for data fabrics and can handle different types of data (such as structured, unstructured, semi-structured, graphical, stream, etc.) This is not going to easy.
Let us deviate a bit into space-time fabric and quantum physics. There are constants such as Planck’s constant, Einstein’s cosmological constant, Boltzmann’s constant, speed of light, etc. These constants are universal. When cosmological objects are modeled some of these constants cannot be changed. They are invariant when mathematical model that are derived or created. Can we define such invariant and constants when defining data fabric models?
Let us imagine data fabric as a medium where objects exist. Also, these objects perform some actions on the data fabric. We can imagine these objects to be (docker) containers and these containers interact with the data fabric through actions. Let us keep things simple and assume CRUD actions are performed by containers on the data fabric (in practice we will use comprehensive APIs for the interactions). Let us assume that the underlying structure of the data fabric is a graph structure with vertices and edges. The underlying semantics of the graph structure can be defined using graph rewrite rules with concrete semantics.
A data fabric is not a “one fit for all” applications. For instance, a data fabric designed for financial services applications will not work for healthcare applications. We will use the term Domain-Driven Data Fabric (DDDF) where we will follow the approach of Domain-Driven Design to design Data Fabric. We propose the following steps for designing a DDDF:
1. Define the Ubiquitous language for the domain. This language will define the semantics and vocabulary for the rest of the steps. It is also recommended that we go through a design thinking workshop (e.g., Stanford Design Thinking) with key stakeholders to understand the empathy of the designer and user of the DDDF.
2. Define some key metrics that DDDF should track. We should use the OKRs (Objectives and Key Requirements) approach. These OKRs should provide baseline invariants and metrics that should be supported by the DDDF. For instance, if we are dealing financial services, we should define OKRs that are important for financial services applications.
3. Define and identify the representations of the data. For instance, we could use knowledge graph for data representations. The knowledge graph should be flexible for representing different types of data (structured, unstructured, real-time, etc.)
4. Define capabilities of the DDDF with well-defined interfaces. The knowledge graph and capabilities define the semantics of the data fabric. Keep in mind that the Knowledge Graph is derived from the domain-specific ubiquitous language and OKRs.
5. Identify and create technologies needed to support the knowledge graph and the capabilities. For instance, if we are developing data fabric for transaction processing applications, then we should choose SQL or relational databases (e.g., DB2, Postgress, etc.) for storing the knowledge graphs and other meta-data.
6. There are many other design choices we must make depending on the OKRs. For instance, we must choose data replication (caching) when availability is important. If the application is bank transactions, then we may want to choose a centralized database when performance and consistency are important.
7. Finally, the DDDF should run on many different cloud (also known as hyperscaler). That means we should consider the capabilities of cloud providers when designing DDDF.
Designing a DDDF is not simple undertaking. Be careful of companies that promises “one size fit for all” data fabric for different kinds of applications. They do not work in general.