We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
Every company today is data-driven or at least claims to be. Business decisions are no longer made based on hunches or anecdotal trends as they were in the past. Concrete data and analytics now power businesses’ most critical decisions.
As more companies leverage the power of machine learning and artificial intelligence to make critical choices, there must be a conversation around the quality—the completeness, consistency, validity, timeliness and uniqueness—of the data used by these tools. The insights companies expect to be delivered by machine learning (ML) or AI-based technologies are only as good as the data used to power them. The old adage “garbage in, garbage out,” comes to mind when it comes to data-based decisions.
Statistically, poor data quality leads to increased complexity of data ecosystems and poor decision-making over the long term. In fact, roughly $12.9 million is lost every year due to poor data quality. As data volumes continue to increase, so will the challenges that businesses face with validating and their data. To overcome issues related to data quality and accuracy, it’s critical to first know the context in which the data elements will be used, as well as best practices to guide the initiatives along.
1. Data quality is not a one-size-fits-all endeavor
Data initiatives are not specific to a single business driver. In other words, determining data quality will always depend on what a business is trying to achieve with that data. The same data can impact more than one business unit, function or project in very different ways. Furthermore, the list of data elements that require strict governance may vary according to different data users. For example, marketing teams are going to need a highly accurate and validated email list while R&D would be invested in quality user feedback data.
The best team to discern a data element’s quality, then, would be the one closest to the data. Only they will be able to recognize data as it supports business processes and ultimately assess accuracy based on what the data is used for and how.
2. What you don’t know can hurt you
Data is an enterprise asset. However, actions speak louder than words. Not everyone within an enterprise is doing all they can to make sure data is accurate. If users do not recognize the importance of data quality and governance—or simply don’t prioritize them as they should—they are not going to make an effort to both anticipate data issues from mediocre data entry or raise their hand when they find a data issue that needs to be remediated.
This might be addressed practically by tracking data quality metrics as a performance goal to foster more accountability for those directly involved with data. In addition, business leaders must champion the importance of their data quality program. They should align with key team members about the practical impact of poor data quality. For instance, misleading insights that are shared in inaccurate reports for stakeholders, which can potentially lead to fines or penalties. Investing in better data literacy can help organizations create a culture of data quality to avoid making careless or ill-informed mistakes that damage the bottom line.
3. Don’t try to boil the ocean
It is not practical to fix a large laundry list of data quality problems. It’s not an efficient use of resources either. The number of data elements active within any given organization is huge and is growing exponentially. It’s best to start by defining an organization’s Critical Data Elements (CDEs), which are the data elements integral to the main function of a specific business. CDEs are unique to each business. Net Revenue is a common CDE for most businesses as it’s important for reporting to investors and other shareholders, etc.
Since every company has different business goals, operating models and organizational structures, every company’s CDEs will be different. In retail, for example, CDEs might relate to design or sales. On the other hand, healthcare companies will be more interested in ensuring the quality of regulatory compliance data. Although this is not an exhaustive list, business leaders might consider asking the following questions to help define their unique CDEs: What are your critical business processes? What data is used within those processes? Are these data elements involved in regulatory reporting? Will these reports be audited? Will these data elements guide initiatives in other departments within the organization?
Validating and remediating only the most key elements will help organizations scale their data quality efforts in a sustainable and resourceful way. Eventually, an organization’s data quality program will reach a level of maturity where there are frameworks (often with some level of automation) that will categorize data assets based on predefined elements to remove disparity across the enterprise.
4. More visibility = more accountability = better data quality
Businesses drive value by knowing where their CDEs are, who is accessing them and how they’re being used. In essence, there is no way for a company to identify their CDEs if they don’t have proper data governance in place at the start. However, many companies struggle with unclear or non-existent ownership into their data stores. Defining ownership before onboarding more data stores or sources promotes commitment to quality and usefulness. It’s also wise for organizations to set up a data governance program where data ownership is clearly defined and people can be held accountable. This can be as simple as a shared spreadsheet dictating ownership of the set of data elements or can be managed by a sophisticated data governance platform, for example.
Just as organizations should model their business processes to improve accountability, they must also model their data, in terms of data structure, data pipelines and how data is transformed. Data architecture attempts to model the structure of an organization’s logical and physical data assets and data management resources. Creating this type of visibility gets at the heart of the data quality issue, that is, without visibility into the *lifecycle* of data—when it’s created, how it’s used/transformed and how it’s outputted—it’s impossible to ensure true data quality.
5. Data overload
Even when data and analytics teams have established frameworks to categorize and prioritize CDEs, they are still left with thousands of data elements that need to either be validated or remediated. Each of these data elements can require one or more business rules that are specific to the context in which it will be used. However, those rules can only be assigned by the business users working with those unique data sets. Therefore, data quality teams will need to work closely with subject matter experts to identify rules for each and every unique data element, which can be extremely dense, even when they are prioritized. This often leads to burnout and overload within data quality teams because they are responsible for manually writing a large sum of rules for a variety of data elements. When it comes to the workload of their data quality team members, organizations must set realistic expectations. They may consider expanding their data quality team and/or investing in tools that leverage ML to reduce the amount of manual work in data quality tasks.
Data isn’t just the new oil of the world: it’s the new water of the world. Organizations can have the most intricate infrastructure, but if the water (or data) running through those pipelines isn’t drinkable, it’s useless. People that need this water must have easy access to it, they must know that it’s usable and not tainted, they must know when supply is low and, lastly, the suppliers/gatekeepers must know who is accessing it. Just as access to clean drinking water helps communities in a variety of ways, improved access to data, mature data quality frameworks and deeper data quality culture can protect data-reliant programs & insights, helping spur innovation and efficiency within organizations around the world.
JP Romero is Technical Manager at Kalypso
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!