DIMACS/CCICADA Workshop on Data Quality Metrics

February 3 - 4, 2011
DIMACS Center, CoRE Building, Rutgers University

Tamraparni Dasu, AT & T Research, tamr at research.att.com
Lukasz Golab, AT & T Research, lgolab at research.att.com
Presented under the auspices of The Homeland Security Center for Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Large-scale databases and data warehouses often suffer from data quality problems, caused by the data-collecting mechanism (e.g., inaccurate sensor readings), by incorporating inconsistent data sources over time, or by a lack of understanding of the semantics of the data. Before ``cleaning'' the data, it is important to understand the extent of these problems. In the simplest case, we can construct a set of data quality rules that determine whether a data record is ``clean'' or ``dirty''. However, these rules may be complex and domain-specific. Furthermore, we may not be able to judge data quality by examining individual records; each record may be correct on its own, but inconsistent with other records. Thus, measuring the quality of a data set or a database is a challenging task.

Data quality metrics have been the focus of research in the database and statistics communities, resulting in complementary approaches. In general, database metrics are motivated by constraint satisfaction while statistical metrics quantify departure from distributional and model assumptions. In this workshop, we explore both types of approaches, with an emphasis on:
- recent advances in research,
- role of data quality metrics in data cleaning,
- applications & case studies, and
- tools.

Relevant topics include the following.

Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 5, 2010.