The term data mesh was coined by Zhamak Dehghani in 2019 and is based on four fundamental principles that bundle well-known concepts. But before we dive deeper into the four main concepts, let us take a closer look at the general definition of a data mesh.
Data mesh is a sociotechnical approach to building a decentralized data architecture by leveraging a domain-oriented, self-serve design (from a software development perspective). It borrows Eric Evans’ theory of domain-driven design and Manuel Pais’ and Matthew Skelton’s theory of team topologies. Let’s explain these terms so we can move forward with more confidence!
Domain-driven design (DDD) is a way to develop software that reflects real-world business concepts and processes in the code, requiring close collaboration between technical and business experts. Matthew Skelton's theory of team topologies supports DDD by organizing teams based on their roles and interactions, such as stream-aligned, enabling, complicated subsystem, and platform teams, to ensure they are well-suited to manage specific business domains effectively. The data mesh mainly concerns itself with the data, taking the data lake and the pipelines as secondary concerns. The main proposition is scaling analytical data by domain-oriented decentralization. With data mesh, the responsibility for analytical data is shifted from the central data team to the domain teams, supported by a data platform team that provides a domain-agnostic data platform.
This system keeps data organized and easy to share by using a central set of rules and standards. Imagine a big company where the sales, marketing, and customer service teams all need customer information. With this centralized approach, they all follow the same guidelines, so they can easily share and understand each other's data. This makes working together much simpler and more efficient.
The four main building blocks of a complete data mesh are:
- Domain ownership: The idea is that instead of just owning the operational data, there is also ownership around analytical data by teams. The domain teams should be responsible for their data rather than a central data team that manages all operational data for an organization.
- Data as a product: The owners of their data must share it as data products, like APIs, with a product-thinking mindset. So, in their point of view, data should be managed as a value with a lifecycle.
- Self-serve data infrastructure platform: The data product should run on a self-service platform managed by the platform teams. To make it easy to share data as data product by the data product team.
- Federated governance: If we have these products shared on the platform, we need some governance rules, which should also apply to the data mesh.
Real-World Use Cases & Challenges
Having introduced the main ideas and definitions of the data mesh and the concepts of an effective data mesh implementation, let's delve into a real-world use case where the data as a product mindset is missing. We will see how the different teams struggle to create effective data pipelines and generate actionable insights, lowering the organization's ability to leverage its data assets for strategic decision-making and innovation. We will go through a solution to how an organization can implement a data as a product mindset and transform its approach to data management, fostering a culture of collaboration, accountability, and continuous improvement to unlock the full potential of its data resources.
Root cause #1: Visibility
Data producers and data consumers are not aware of each other. The data producers maintain transactional/operational databases, and those databases support their applications and features, but they are never intended to support downstream machine learning and analytics teams. Let's consider a scenario where a column type is modified or removed by the operational team. The producers, not knowing how they are impacted by the changes, are operating in the dark. On the other hand, the data consumers have no visibility about what changes caused the problems they are experiencing. They knew they had data quality issues; they saw that data inconsistency caused problems and failed pipelines. So, they are starting to send more and more requests and service tickets to the upstream team (operational data engineers) about their problem.
Root cause #2: Pipeline Evolution
The lack of well-implemented methods to iteratively evolve pipelines over time, both upstream change and business logic change, has no mature solution. Due to these missing methods, increasingly complex filtering was added, resulting in the generation of more complex SQL queries. This makes it harder to bring in new people or take on other people's work because the team ends up with very complex queries, which are difficult to interpret and parse for anyone who has not written the code. Consequently, individuals may retreat to building their own tables instead of navigating through the complicated code, leading to duplication and multiple versions of the same query.
Root cause #3: Lack of ownership
Additionally, in root cause #2, the results of "spaghetti SQL" create numerous OBTs (One Big Tables), where users add their columns of interest onto shared tables owned by others. This creates a complex data structure where the original purpose of the table becomes unclear, and no one takes ownership. After all, the data producers (central data engineering team) often do not feel accountable for the database they create. Consequently, when problems arise, they distance themselves, claiming it is not their responsibility. Unfortunately, crucial products like ML models and dashboards rely on these unowned tables. Accordingly, when issues occur, there is a lack of initiative among team members to address them.
Solution Approach: Our Digital Highway
The downstream teams, like machine learning or analytic teams, cannot operate at scale with the previously mentioned problems. The data infrastructure will incur massive costs. Machine learning models must be built on high-quality data; training on incorrect data results in inaccurate predictions, undermining customer trust and the platform's credibility, potentially leading to a loss of business. Similarly, analytics teams using poor-quality data will produce unreliable dashboards, leading to managers misguiding decisions. In our proposal for implementing a data product mindset, we emphasize the critical importance of domain ownership and accountability over data products within organizations. In our previous blog post on our Digital Highway approach, we emphasized the focus on empowering teams to take ownership of their data assets, ensuring that the highest quality datasets used for machine learning models and customer-facing metrics are managed effectively. By establishing clear domain ownership, data as a product mindset, data contracts, and the presence of a vast array of tests, teams can avoid the pitfalls of disjointed data pipelines, unmanaged data assets, and low-quality data, which often lead to increased costs and unreliable results.
You can read more about our Digital Highway and how this approach enables reliable and continuous delivery of data and machine learning systems here:
Domain ownership
This concept refers to the organization of the data in line with the business itself. An organization is divided into divisions by areas of operations, also known as business domains. However, a centralized data team manages all the data infrastructures and data itself. The question is, how do we decompose and decentralize the components of the data ecosystem and their ownership into business domains? The domain ownership principle mandates the domain teams to take responsibility for their data, like team boundaries aligning with the system’s bounded context. Data mesh argues that the ownership and serving of the analytical data should respect these domains. It is necessary to model the current data architecture based on the organization of analytical data by domains to distribute responsibility and decentralize the already known monolithic architectures. For example, the teams who manage domain "A” while providing APIs for releasing “A” should also be responsible for providing historical data that represents ‘released As” over time with other facts such as other “A” related data over time.
Data Product
A data product serves as a logical unit that contains all components decomposed into logical business domain data for analytical or data-intensive use cases and makes them available to other teams. The data products connect to source systems, which can be data warehouses or data lakes, such as parquet files in a cloud file storage system. After access was created to the source, the data product teams must transform the data. These transformations contain different kinds of quality checks, aggregations, normalization, data cleansing, etc... After that, the team must serve this data to be available to other data products teams or downstream usage (BI tool, data analysts, data scientists). The domain team owns the data product. This team is responsible for the operations of the data product during its entire lifecycle. The team must act as an entity whose primary responsibility is to “sell” this data product to other teams and downstream usage. To be able to “sell” it, they must continuously monitor and ensure data quality, availability, and cost.
Data Contract
For a complete Digital Highway approach, we must include other important aspects related to data products and domain ownership. One of them is data contracts:
A data contract is a document that defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product's output port or other data technologies, mostly microservices. Over the past year, data contracts have taken the data world by storm as a novel approach for ensuring data quality at scale. More and more organizations want to build reliable data systems and implement data meshes, and data contracts are a big part of the new phenomenon. Data engineers must serve more downstream users, creating a significant demand for quality data (imagine a use case where a continuously evolving and replaced machine learning model is continuously retrained with batches of new data). In this case, data contracting enables quality assurance and is helpful for the downstream data science team.
Nowadays, the concept of reliable data engineering, also known as, is gaining prominence. It integrates DevOps and Site Reliability Engineering (SRE) best practices to ensure continuous and reliable software delivery while maintaining high data quality. It also helps work with unified data platforms (data mesh with self-serve data infrastructure) to ensure continuous and reliable software delivery without losing data quality. With data mesh, data as a product, and data contracts, Data Reliability Engineering is critical to ensuring that data products are reliable, well-governed, and easy to manage.
If you are interested in more details about these terms as well, you can find them in our previous blog posts:
- Introduction to Reliability & Collaboration for Data & ML Lifecycles
- Data Reliability Engineering & Unified Analytics
Summary
In summary, implementing a data mesh can significantly help organizations become data-driven in a collaborative and agile environment. It also improves data quality and collaboration by decentralizing data ownership and treating data as a product. This approach ensures that domain teams are accountable for their data assets, fostering better integration and higher data quality across the enterprise. By leveraging principles of domain-driven design and team topologies, a data mesh addresses common pitfalls such as lack of visibility, pipeline evolution, and ownership issues. A well-executed data mesh ultimately transforms data management, enabling more effective and strategic use of data resources.
Are you facing challenges in implementing data mesh and data software architectures, tools, and best practices that might be unfamiliar to you? Managing end-to-end data pipelines with the data as a product mindset can be complex and time-consuming, from assessing and implementing necessary technologies to establishing effective DataOps workflows and practices, such as Data Reliability Engineering. Collaborating with an experienced and independent partner can be highly beneficial.
Machine Learning Architects Basel (MLAB)
Machine Learning Architects Basel (MLAB) is part of the Swiss Digital Network (SDN) and has developed several DataOps solutions for our customers, showing them best practices and enabling seamless integration, enhanced data quality, and efficient workflows. Contact our team if you are interested in how MLAB can help your organization create sustainable benefits by developing and maintaining reliable data solutions.
Also if you are interested about how to apply data contracts and data tests in practice (using sate of the art DataOps solutions) please follow this link for a hands-on tech guide.
References and Acknowledgements
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- Domain-driven design : tackling complexity in the heart of software
- Team topologies : organizing business and technology teams for fast flow
- Data Mesh: Concepts and Principles of a Paradigm Shift in Data Architectures
- Data Mesh Architecture
- Data Mesh
- Data Mesh: Concepts and Principles of a Paradigm Shift in Data Architectures