Large-scale databases and data warehouses often suffer from data quality problems, caused by the data-collecting mechanism (e.g., inaccurate sensor readings), by incorporating inconsistent data sources over time, or by a lack of understanding of the semantics of the data. Before ``cleaning'' the data, it is important to understand the extent of these problems. In the simplest case, we can construct a set of data quality rules that determine whether a data record is ``clean'' or ``dirty''. However, these rules may be complex and domain-specific. Furthermore, we may not be able to judge data quality by examining individual records; each record may be correct on its own, but inconsistent with other records. Thus, measuring the quality of a data set or a database is a challenging task.
Data quality metrics have been the focus of research in the database
and statistics communities, resulting in complementary approaches.
In general, database metrics are motivated by constraint satisfaction
while statistical metrics quantify departure from distributional and model
assumptions. In this workshop, we explore both types of approaches, with
an emphasis on:
- recent advances in research,
- role of data quality metrics in data cleaning,
- applications & case studies, and
- tools.
Relevant topics include the following.