Why semantics matter in the modern data stack
Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Most organizations are now well into re-platforming their enterprise data stacks to cloud-first architectures. The shift in data gravity to centralized cloud data platforms brings enormous potential. However, many organizations are still struggling to deliver value and demonstrate true business outcomes from their data and analytics investments.
The term “modern data stack” is commonly used to define the ecosystem of technologies surrounding cloud data platforms. To date, the concept of a semantic layer hasn’t been formalized within this stack.
When applied correctly, a semantic layer forms a new center of knowledge gravity that maintains the business context and semantic meaning necessary for users to create value from enterprise data assets. Further, it becomes a hub for leveraging active and passive metadata to optimize the analytics experience, improve productivity and manage cloud costs.
What is the semantic layer?
Wikipedia describes the semantic layer as “a business representation of data that lets users interact with data assets using business terms such as product, customer or revenue to offer a unified, consolidated view of data across the organization.”
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
The term was coined in an age of on-premise data stores — a time when business analytics infrastructure was costly and highly limited in functionality compared to today’s offerings. While the semantic layer’s origins lie in the days of OLAP, the concept is even more relevant today.
What is the modern data stack?
While the term “modern data stack” is frequently used, there are many representations of what it means. In my opinion, Matt Bornstein, Jennifer Li and Martin Casado from Andreessen Horowitz (A16Z) offer the cleanest view in Emerging Architectures for Modern Data Infrastructure.
I will refer to this simplified diagram based on their work below:
This representation tracks the flow of data from left to right. Raw data from various sources move through ingestion and transport services into core data platforms that manage storage, query and processing and transformation prior to being consumed by users in a variety of analysis and output modalities. In addition to storage, data platforms offer SQL query engines and access to Artificial Intelligence (AI) and machine learning (ML) utilities. A set of shared services cuts across the entire data processing flow at the bottom of the diagram.
Where is the semantic layer?
A semantic layer is implicit any time humans interact with data: It arises organically unless there is an intentional strategy implemented by data teams. Historically, semantic layers were implemented within analysis tools (BI platforms) or within a data warehouse. Both approaches have limitations.
BI-tool semantic layers are use case specific; multiple semantic layers tend to arise across different use cases leading to inconsistency and semantic confusion. Data warehouse-based approaches tend to be overly rigid and too complex for business users to work with directly; work groups will end up extracting data to local analytics environments — again leading to multiple disconnected semantic layers.
I use the term “universal semantic layer” to describe a thin, logical layer sitting between the data platform and analysis and output services that abstract the complexity of raw data assets so that users can work with business-oriented metrics and analysis frameworks within their preferred analytics tools.
The challenge is how to assemble the minimum viable set of capabilities that gives data teams sufficient control and governance while delivering end-users more benefits than they could get by extracting data into localized tools.
Implementing the semantic layer using transformation services
The set of transformation services in the A16Z data stack includes metrics layer, data modeling, workflow management and entitlements and security services. When implemented, coordinated and orchestrated properly, these services form a universal semantic layer that delivers important capabilities, including:
- Creating a single source of truth for enterprise metrics and hierarchical dimensions, accessible from any analytics tool.
- Providing the agility to easily update or define new metrics, design domain-specific views of data and incorporate new raw data assets.
- Optimize analytics performance while monitoring and optimizing cloud resource consumption.
- Enforce governance policies around access control, definitions, performance and resource consumption.
Let’s step through each transformation service with an eye toward how they must interact to serve as an effective semantic layer.
Data modeling is the creation of business-oriented, logical data models that are directly mapped to the physical data structures in the warehouse or lakehouse. Data modelers or analytics engineers focus on three important modeling activities:
Making data analytics-ready: Simplifying raw, normalized data into clear, mostly de-normalized data that is easier to work with.
Definition of analysis dimensions: Implementing standardized definitions of hierarchical dimensions that are used in business analysis — that is, how an organization maps months to fiscal quarters to fiscal years.
Metrics design: Logical definition of key business metrics used in analytics products. Metrics can be simple definitions (how the business defines revenue or ship quantity). They can be calculations, like gross margin ([revenue-cost]/revenue). Or they can be time-relative (quarter-on-quarter change).
I like to refer to the output of semantic layer-related data modeling as a semantic model.
The metrics layer
The metrics layer is the single source of metrics truth for all analytics use cases. Its primary function is maintaining a metrics store that can be accessed from the full range of analytics consumers and analytics tools (BI platforms, applications, reverse ETL, and data science tools).
The term “headless BI” describes a metrics layer service that supports user queries from a variety of BI tools. This is the fundamental capability for semantic layer success — if users are unable to interact with a semantic layer directly using their preferred analytics tools, they will end up extracting data into their tool using SQL and recreating a localized semantic layer.
Additionally, metrics layers need to support four important services:
Metrics curation: Metrics stewards will move between data modeling and the metrics layer to curate the set of metrics provided for different analytics use cases.
Metrics change management: The metrics layer serves as an abstraction layer that shields the complexity of raw data from data consumers. As a metrics definition changes, existing reports or dashboards are preserved.
Metrics discoverability: Data product creators need to easily find and implement the proper metrics for their purpose. This becomes more important as the list of curated metrics grows to include a broader set of calculated or time-relative metrics.
Metrics serving: Metrics layers are queried directly from analytics and output tools. As end users request metrics from a dashboard, the metrics layer needs to serve the request fast enough to provide a positive analytics user experience.
Transformation of raw data into an analytics-ready state can be based on physical materialized transforms, virtual views based on SQL or some combination of those. Workflow management is the orchestration and automation of physical and logical transforms that support the semantic layer function and directly impact the cost and performance of analytics.
Performance: Analytics consumers have a very low tolerance for query latency. A semantic layer cannot introduce a query performance penalty; otherwise, clever end users will again go down the data extract route and create alternative semantic layers. Effective performance management workflows automate and orchestrate physical materializations (creation of aggregate tables) as well as decide what and when to materialize. This functionality needs to be dynamic and adaptive based on user query behavior, query runtimes and other active metadata.
Cost: The primary cost tradeoff for performance is related to cloud resource consumption. Physical transformations executed in the data platform (ELT transforms) consume compute cycles and cost money. End user queries do the same. The decisions made on what to materialize and what to virtualize directly impact cloud costs for analytics programs.
Analytics performance-cost tradeoff becomes an interesting optimization problem that needs to be managed for each data product and use case. This is the job of workflow management services.
Entitlements and security
Transformation-related entitlements and security services relate to the active application of data governance policies to analytics. Beyond cataloging data governance policies, the modern data stack must enforce policies at query time, as metrics are accessed by different users. Many different types of entitlements may be managed and enforced alongside (or embedded in) a semantic layer.
Access control: Proper access control services ensure all users can get access to all of the data they are entitled to see.
Model and metrics consistency: Maintaining semantic layer integrity requires some level of centralized governance of how metrics are defined, shared and used.
Performance and resource consumption: As discussed above, there are constant tradeoffs being made on performance and resource consumption. User entitlements and use case priority may also factor into the optimization.
Real time enforcement of governance policies is critical for maintaining semantic layer integrity.
Integrating the semantic layer within the modern data stack
Layers in the modern data stack must seamlessly integrate with other surrounding layers. The semantic layer requires deep integration with its data fabric neighbors — most importantly, the query and processing services in the data platform and analysis and output tools.
Data platform integration
A universal semantic layer should not persist data outside of the data platform. A coordinated set of semantic layer services needs to integrate with the data platform in a few important ways:
Query engine orchestration: The semantic layer dynamically translates incoming queries from consumers (using the metrics layer logical constructs) to platform-specific SQL (rewritten to reflect the logical to physical mapping defined in the semantic model).
Transform orchestration: Managing performance and cost requires the capability to materialize certain views into physical tables. This means the semantic layer must be able to orchestrate transformations in the data platform.
AI/ML integration: While many data science activities leverage specialized tools and services accessing raw data assets directly, a formalized semantic layer creates the opportunity to provide business vetted features from the metrics layer to data scientists and AI/ML pipelines.
Tight data platform integration ensures that the semantic layer stays thin and can operate without persisting data locally or in a separate cluster.
Analysis and output
A successful semantic layer, including a headless BI approach to implementing the metrics layer, must be able to support a variety of inbound query protocols — including SQL (Tableau), MDX (Microsoft Excel), DAX (Microsoft Power BI), Python (data science tools), and RESTful interfaces (for application developers) — using standard protocols such as ODBC, JDBC, HTTP(s) and XMLA.
Leading organizations incorporate data science and enterprise AI into everyday decision-making in the form of augmented analytics. A semantic layer can be helpful in successfully implementing augmented analytics. For example:
- Semantic layers can support natural language query initiatives. “Alexa, what was our sales revenue last quarter?” will only return the right results if Alexa has a clear understanding of what revenue and time mean.
- Semantic layers can be used to publish AI/ML-generated insights (predictions and forecasts) to business users using the same analytics tools they use to analyze historical data.
- Beyond just prediction values, semantic layers can make broader inference data available to business users in a way that can enhance explainability and trust in enterprise AI.
The center of mass for knowledge gravity in the modern data stack
The A16Z model implies that organizations could assemble a fabric of home-grown or single-purpose vendor offerings to build a semantic layer. While certainly possible, success will be determined by how well-integrated individual services are. As noted, even if a single service or integration fails to deliver on user needs, localized semantic layers are inevitable.
Furthermore, it is important to consider how vital business knowledge gets sprinkled across data fabrics in the form of metadata. The semantic layer has the advantage of seeing a large portion of active and passive metadata created for analytics use cases. This creates an opportunity for forward-thinking organizations to better manage this knowledge gravity and better leverage metadata for improving the analytics experience and driving incremental business value.
While the semantic layer is still emerging as a technology category, it will clearly play an important role in the evolution of the modern data stack.
This article is a summary of my current research around semantic layers within the modern, cloud-first data stack. I’ll be presenting my full findings at the upcoming virtual Semantic Layer Summit on April 26, 2023.
David P. Mariani is CTO and cofounder of AtScale, Inc.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!