the data are inconsistent.
Validity: Whether data conform to the specified format. For example, the Purchase Date field contains many different date formats; which is the valid format?
Timeliness: Whether data are up to date and are available to users in a timely manner. For example, the entry for brick 045 could have been added two months after the purchase date, which is slower than the required update frequency. Additionally, if bricks are being purchased daily, then an absence of new data could indicate that the data update process has failed.
Uniqueness: Whether a single representation exists for each physical entity. For example, in the table, no ID appears twice, therefore it is likely that all entries for these bricks are unique.
This example analysis is the starting point for data quality, but further work would need to be done to provide a complete technical approach to ensure data are fit for purpose. This involves generating an explicit data specification to capture all the identified requirements and a set of tests to ensure the data meet these requirements. These tests vary from simple (e.g. comparing the content of a data set to the formal definition in the data specification of the required syntax) to complex (e.g. identifying if, for all
Managing Data Quality
14
current customers, contact details exist and are correct in the customer relationship management database).
In summary, data quality dimensions prompt the analysis of data requirements. These dimensions are, however, ultimately superseded by the content of the resulting data specification, which becomes the formal basis on which to test the quality of each relevant data set.
Given these technical complexities that underpin data quality, organisations face a challenge to ensure a consistent, effective and efficient approach to data management across all relevant stakeholders. Facing this challenge is the role of data quality management.
What is data quality management?
The subject of this book is data quality management, so it is important that the meaning of this term is clear. ISO 8000-2 defines data quality management as:
Whilst definitions in ISO standards can sometimes require a little effort to understand, this definition is relatively clear. In essence, it describes an overall approach consisting of different activities to monitor, manage and control data quality with suitable oversight to direct and control these activities.
Data quality management is more than just managing data quality; it involves consideration of why data are incorrect in the first place. For example, if you are undertaking a data cleansing exercise without also addressing the underlying root cause of the data errors, then it is highly likely to result in the data cleansing having to be repeated on a regular basis.
Data quality management is also not about trying to achieve an idealistic, ‘perfect’ data set. As mentioned earlier, the costs, time and effort to achieve perfection will not be attractive to any organisation and would probably be impossible to achieve. Data quality management is, therefore, about balancing current data quality with required quality and the benefits that can be achieved by these improvements.
Summary
Data are a key element of any enterprise.
By treating data as an asset, the enterprise focuses on delivering value from data.
Data quality is conformance to requirements rather than abstract perfection.
The next chapter explores the challenge of managing the requirements to establish the foundation for conformance.
coordinated activities to direct and control an organization with regard to data quality.
15
Managing data quality is not an easy or simple task, and there are various factors that determine the purpose and scope of data quality management in an enterprise context. This chapter explores the challenges of those factors and provides a summary checklist to help you identify those challenges that apply in your own organisation.
The complex data landscape
Within all but the smallest of organisations and enterprises, there will typically be numerous enterprise software tools, specialist decision support tools and databases or spreadsheets created by end users. There could also be a legacy of paper records and documents to consider. When cloud data stores and web-based software services that can be quickly established are part of the equation too, the data landscape is even more complex and getting more so at a rapid rate. Physical locations of data stores for an organisation are no longer solely in premises owned by that organisation.
Each of these data stores is likely to have a complex data structure to suit the requirements of the software. Developing the data models for these data stores will be a large task for an experienced data modeller. Taking a ‘step up’, enterprise architects should have an overview of the conceptual and logical data models for each of the corporate data stores. They should also understand the different areas where the same or similar data are stored.
This leads into the challenge of master data management (MDM); in other words, for all the entities that exist in more than one data store, there is awareness not only of all these entities, but also of the ‘master’ data source that is the ‘single source of truth’. Good examples of entities that are likely to appear in multiple data stores include: customers; products; employees; assets; and materials.
As data updates are required, MDM is primarily a business approach to ensure they are first applied to the master data source and then replicated to all the dependent data sources. This process can be supported by specific MDM software tools. It needs to be stressed, however, that these can be expensive to install, complex to implement and difficult to maintain, so will not be relevant to every situation. It is therefore prudent to also consider reducing (and eventually minimising) the number of different data representations of the same entity within the organisation, in order to reduce the amount of ‘work’ required for the chosen MDM approach.
2 Challenges when exploiting and managing data
Managing Data Quality
16
Complex decisions
Decision making in most organisational contexts can range from very simple through to extremely complex. Simple decisions, such as where to send a technician, will possibly not require much input data and will only have limited consequences if the wrong decision is taken. In contrast, more complex decision making, such as strategic planning, life cycle costing and project planning, is likely to have more extensive decision logic, more significant consequences and a greater reliance on the quality of input data.
To look at a slightly different context, in a biology or physics experiment, it is likely to be understood that there are factors that cannot be fully controlled in the experiment and that the accuracy of measurements is not perfect. Therefore, results are often expressed as a range, for example 254 +/- 10. In the case of many complex business decisions, the quality of input data will be as variable as that encountered in such a biology experiment, yet typically the outputs are not likely to include any expression of sensitivity. This could easily lead to incorrect assumptions about the certainty of the decision.
In summary, you must understand the decisions that your data support so that you can determine the extent to which data quality will influence the reliability of those decisions.
Virtuous circle or downward spiral?
In general, the decision-making process will be influenced by data quality. What you should be trying to avoid is a downward spiral where poor data quality leads to poorer information quality. In turn, this will tend to lead to incorrect business decisions and hence worse results. A poorly thought out project, decision or activity is likely to lead to worse data being generated as a result of staff being demoralised