DIMACS/CCICADA Workshop on Big Data Integration

June 20 - 21, 2013
DIMACS Center, CoRE Building, Rutgers University

Xin Luna Dong, AT&T Labs-Research, lunadong at research.att.com
Zachary Ives, University of Pennsylvania, zives at cis.upenn.edu
Presented under the auspices of the DIMACS Special Focus on Information Sharing and Dynamic Data Analysis and The Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).


Juliana Freire, NYU-Poly

Title: Big Data Analysis and Integration

The explosion in the volume of publicly-available structured data has created new opportunities in many different domains, from spurring scientific discoveries to enabling smarter cities that provide a better quality of life for its citizens. Unfortunately, the infrastructure to analyze, visualize, and integrate information has not kept pace with our collective ability to gather data, leading to an unprecedented situation: data integration and exploration are now bottlenecks to discovery. In this talk, I will present a family of techniques that we have designed to integrate data at a large scale. I will then discuss challenges involved in the interactive exploration of big structured data sets and describe preliminary work we have done to support visual analysis of urban data.


Juliana Freire is a Professor at the Department of Computer Science and Engineering at the Polytechnic Institute of New York University. She also holds an appointment in the Courant Institute for Mathematical Science. Before, she was an Associate Professor at the School of Computing, University of Utah; an Assistant Professor at OGI/OHSU; and a member of technical staff at the Database Systems Research Department at Bell Laboratories (Lucent Technologies). Her recent research has focused on Web-scale data integration, big-data analysis and visualization, provenance and scientific data management. Professor Freire is an active member of the database and Web research communities. She has co-authored over 130 technical papers, holds 8 U.S. patents, has chaired or co-chaired several workshops and conferences, and has participated as a program committee member in over 60 events. She is a recipient of an NSF CAREER and an IBM Faculty award. Her research has been funded by grants from the National Science Foundation, Department of Energy, National Institutes of Health, the University of Utah, Sloan Foundation, Microsoft Research, Yahoo!, Amazon, and IBM.

Tim Kraska, Brown University

Title: CrowdER: Crowdsourcing Entity Resolution

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach, CrowdER, in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NPHard, but we develop a novel two-tiered heuristic approach for creating batched tasks. Furthermore, in recent work we extended CrowdER to explore transitive relations between entities to further reduce the cost and time for entity resolution. Experimental results show show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.


Tim Kraska is an Assistant Professor in the Computer Science department at Brown University. Currently, his research focuses on large scale analytics, multi-data center consistency and hybrid human/machine database systems. Before joining Brown, Tim Kraska spent 2 years as a PostDoc in the AMPLab at UC Berkeley after receiving his PhD from ETH Zurich, where he worked on transaction management and stream processing in the cloud. He was awarded a Swiss National Science Foundation Prospective Researcher Fellowship (2010), a DAAD Scholarship (2006), a University of Sydney Master of Information Technology Scholarship for outstanding achievement (2005), the University of Sydney Siemens Prize (2005), a VLDB best demo award (2011) and an ICDE best paper award (2013).

Renee J. Miller, Univ. of Toronto

Title: Big Data Curation

A new mode of inquiry, problem solving, and decision making has become pervasive in our society, consisting of applying computational, mathematical, and statistical models to infer actionable information from large quantities of data. This paradigm, often called Big Data Analytics or simply Big Data, requires new forms of data management to deal with the volume, variety, and velocity of Big Data. Many of these data management problems can be described as data curation. Data curation includes all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data and to generate new sources of information and knowledge. In this talk, I describe our experience in curating several open data sets. I overview how we have adapted some of the traditional solutions for aligning data and creating semantics to account for (and take advantage of) Big Data.


Renee J. Miller received BS degrees in Mathematics and in Cognitive Science from the Massachusetts Institute of Technology. She received her MS and PhD degrees in Computer Science from the University of Wisconsin in Madison, WI. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE) , the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers and the National Science Foundation Career Award. She is a Fellow of the ACM, the President of the VLDB Endowment, and was the Program Chair for ACM SIGMOD 2011 in Athens, Greece. She and her co-authors received the 2013 ICDT Test-of-Time Award for their influential 2003 paper establishing the foundations of data exchange. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans data integration, data exchange, and data curation. She is a Professor and the Bell Canada Chair of Information Systems at the University of Toronto. In 2011, she was elected to the Fellowship of the Royal Society of Canada (FRSC), Canada's national academy.

Tom M. Mitchell, CMU

Title: Learning to Read the Web

One of the great technical challenges in big data is to construct computer systems that learn continuously over years, from a continuing stream of diverse data, improving their competence at a variety of tasks, and becoming better learners over time.

We describe our research to build a Never-Ending Language Learner (NELL) that runs 24 hours per day, forever, learning to read the web. Each day NELL extracts (reads) more facts from the web, and integrates these into its growing knowledge base of beliefs. Each day NELL also learns to read better than yesterday, enabling it to go back to the text it read yesterday, and extract more facts, more accurately, today. NELL has been running 24 hours/day for over three years now. The result so far is a collection of 50 million interconnected beliefs (e.g., servedWtih(coffee, applePie), isA(applePie, bakedGood)), that NELL is considering at different levels of confidence, along with hundreds of thousands of learned phrasings, morphological features, and web page structures that NELL has learned to use to extract beliefs from the web. Track NELL's progress at http://rtw.ml.cmu.edu.


Tom M. Mitchell founded and chairs the Machine Learning Department at Carnegie Mellon University, where he is the E. Fredkin University Professor. His research uses machine learning to develop computers that are learning to read the web, and uses brain imaging to study how the human brain understands what it reads. Mitchell is a member of the U.S. National Academy of Engineering, a Fellow of the American Association for the Advancement of Science (AAAS), and a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI). He believes the field of machine learning will be the fastest growing branch of computer science during the 21st century. Mitchell's web page is http://www.cs.cmu.edu/~tom.

Divesh Srivastava, AT&T Labs-Research

Title: In Search of Truth (on the Deep Web)

The Deep Web has enabled the availability of a huge amount of useful information and people have come to rely on it to fulfill their information needs in a variety of domains. We present a study on the accuracy of data and the quality of Deep Web sources in two domains where quality is important to people's lives: Stock and Flight. We observe that, even in these domains, the quality of the data is less than ideal, with sources providing conflicting, out-of-date and incomplete data. Sources also copy, reformat and modify data from other sources, making it difficult to discover the truth. We describe techniques proposed in the literature to solve these problems, evaluate their strengths on our data, and identify directions for future work in this area.


Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his Ph.D. from the University of Wisconsin, Madison, and his B.Tech from the Indian Institute of Technology, Bombay. He is a fellow of the ACM, and his research interests span a variety of topics in data management.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on June 6, 2013.