Data Quality
Data quality measures value provided by data. High quality data provides value when it meets the business needs and expectations of its consumers.
Data quality dimensions and business rules are two concepts that help determine data quality. A data quality dimension is “a measurable feature or characteristics of data” and provide a vocabulary to measurably quantify data quality. Business rules “describe expectations about the quality characteristics of data” and therefore form the foundation of defining data quality. There are many different sets of data quality dimensions and even significant overlap between dimensions but some of the more common dimensions are:
- Accuracy – degree to which data correctly represents real world values or entities
- Completeness – refers to whether required data is present
- Consistency / Integrity – refers to whether records and their attributes are consistent across systems and time
- Reasonability – the notion data meets the assumptions and expectations of its domain
- Timeliness – whether data is up-to-date and/or available when it is needed
- Uniqueness – degree that data is allowed to have duplicate values
- Validity / Conformity – whether data conforms to the defined domain of values in type, format, and precision
An example business rule is: a state code is required and must be in the ISO 2-character format. This rule involves the completeness and validity dimensions. Required indicates the state code have a certain degree of completeness. In terms of validity, the state code also needs to conform to a specific format. Rules and dimensions are used as the basis for defining data quality metrics and metrics are used to estimate the cost of poor quality data.
Why do Data Quality?
Gartner Inc. has estimated poor data quality costs the average enterprise about $15 million annually. In 2016, IBM estimated poor data quality costs the US economy a staggering $3.1 trillion a year. The Harvard Business Review featured an analysis that knowledge or information workers spend as much as 50% of their time improving data quality and the New York Times found that data scientists could spend between 50-80% of their working time cleaning data.