The word "data" has taken on a broad meaning in the last five years. It is no longer a set of numbers or even text. New data paradigms include data streams characterized by a high rate of accumulation, web scraped documents and tables, web server logs, images, audio and video, to name a few. Well-known challenges of heterogeneity and scale continue to grow as data are integrated from disparate sources and become more complex in size and content.
While new paradigms have enriched data, the quality of data has declined considerably. In earlier times, data were collected as a part of pre-designed experiments where data collection could be monitored to enforce data quality standards. The data sets themselves were small enough that even if data collection was unsupervised, the data could be quickly scrubbed through highly manual methods. Today, neither monitoring of data collection nor manual scrubbing of data is feasible due to the sheer size and complexity of the data.
An additional challenge in addressing data quality is the domain dependence of problems and solutions. Metadata and domain expertise have to be discovered and incorporated into the solutions, entailing an extensive interaction with widely scattered experts. This particular aspect of data quality makes it difficult to find general one-size-fits-all solutions. However, the process of discovering metadata and domain expertise can be automated through the development of appropriate tools and techniques such as data browsing and exploration, knowledge representation and rule based programming.
Many disciplines have taken piecemeal approaches to data quality. The areas of process management statistics, data mining database research and metadata coding have all developed their own ad hoc approaches to solve different pieces of the data quality puzzle. These include statistical techniques for process monitoring, treatment of incomplete data and outliers, techniques for monitoring and auditing data delivery processes, database research for integration, discovery of functional dependencies and join paths, and languages for data exchange and metadata representation.
We need an integrated end-to-end approach within a common framework, where the various disciplines can complement and leverage each other's strengths. In this workshop, our broad objective is to bring together experts from different research disciplines to initiate a comprehensive technical discussion on data quality, data cleaning and treatment of noisy data. Specifically,
The format of the workshop will be a combination of invited talks, contributed papers and posters. Invited and contributed talks will be published in the workshop proceedings.