DIMACS/CCICADA Workshop on Systems and Analytics of Big Data

DIMACS/CCICADA Workshop on Systems and Analytics of Big Data

March 17 - 18, 2014
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Joseph Gonzalez, UC Berkeley; Daniel Hsu, Columbia University; Li Erran Li, Bell Labs, erranlli at gmail.com

Presented under the auspices of the Special Focus on Information Sharing and Dynamic Data Analysis and The Command, Control, and Interoperability Center for Advanced Data Analysis (CCICADA).

Abstracts:

Matvey Arye, Princeton University

Title: Computing on JetStream: Streaming Analytics in the Wide-Area

In this talk, I will present JetStream, a system that allows real-time analysis of large, widely-distributed changing data sets. Traditional approaches to distributed analytics require users to specify in advance which data is to be backhauled to a central location for analysis. This is a poor match for domains where available bandwidth is scarce and it is infeasible to collect all potentially useful data.

JetStream addresses bandwidth limits in two ways, both of which are explicit in the programming model. The system incorporates structured storage in the form of OLAP data cubes, so data can be stored for analysis near where it is generated and collected centrally only when needed to fulfill queries. Using cubes, queries can aggregate data in ways and locations of their choosing. The system also includes adaptive filtering and transformation that adjusts data quality to match available bandwidth in order to support real-time queries. Many bandwidth-saving transformations are possible; we discuss which are appropriate for which data and how they can best be combined.

We implemented a range of analytic queries on web request logs and image data. Queries could be expressed in a few lines of code. Using structured storage on source nodes resulted in minimal bandwidth consumption compared to copying raw logs. Our adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two, and flexible enough to express practical policies.

Dhruv Batra, Virginia Tech

Title: CloudCV: Computer Vision as a Cloud Service

A recent World Economic Form report and a New York Times article declared data to be a new class of economic asset, like currency or gold. Visual content is arguably the fastest growing data on the web. Photo-sharing websites like Flickr and Facebook now host more than 6 and 90 Billion photos (respectively). Besides consumer data, diverse scientific communities (Civil & Aerospace Engineering, Computational Biology, Bioinformatics, and Astrophysics, etc) are also beginning to generate massive archives of visual content, without necessarily the expertise or tools to analyze them.

In this talk, I will describe CloudCV, an ambitious system that will provide access to state-of-the-art distributed computer vision algorithms as a cloud service. Our goal is to democratize computer vision; one should not have to be a computer vision, big data and distributed computing expert to have access to state-of-the-art distributed computer vision algorithms. As the first step, CloudCV is focused on object detection and localization in images. CloudCV provides APIs for detecting if one of 200 different object categories such as entities (person, dog, cat, horse, etc), indoor objects (chair, table, sofa, etc), outdoor objects (car, bicycle, etc) are present in the image.

Bio: Dhruv Batra is an Assistant Professor at the Bradley Department of Electrical and Computer Engineering at Virginia Tech, where he leads the VT Machine Learning & Perception group. He is a member of the Virginia Center for Autonomous Systems (VaCAS) and the VT Discovery Analytic Center (DAC). Prior to joining VT, he was a Research Assistant Professor at Toyota Technological Institute at Chicago (TTIC), a philanthropically endowed academic computer science institute located in the campus of University of Chicago. He received his M.S. and Ph.D. degrees from Carnegie Mellon University in 2007 and 2010 respectively, advised by Tsuhan Chen. In past, he has held visiting positions at the Machine Learning Department at CMU, and at CSAIL MIT. His research interests lie at the intersection of machine learning, computer vision and AI, with a focus on developing scalable algorithms for learning and inference in probabilistic models for holistic scene understanding. He has also worked on other topics such as interactive co-segmentation of large image collections, human body pose estimation, action recognition, depth estimation and distributed optimization for inference and learning in probabilistic graphical models. He was a recipient of the Carnegie Mellon Dean's Fellowship in 2007, the Google Faculty Research Award in 2013, and the Virginia Tech Teacher of the Week in 2013. His research is supported by NSF, Amazon, Google, Microsoft, and Nvidia.

Leon Bottou, Microsoft Research

Title: Counterfactual Reasoning and Learning Systems

Statistical machine learning technologies in the real world are never without a purpose. Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified the system design in the first place. For instance, such contradictions appear very clearly in computational advertisement systems. The design of the ad placement engine directly influences the occurrence of clicks and the corresponding advertiser payments. It also has important indirect effects : (a) ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this web site in the future, (b) ad placement decisions impact the return on investment observed by the advertisers and therefore their future bids, and (c) ad placement decisions change the nature of the data collected for training the statistical models in the future. Popular theoretical approaches, such as auction theory or multi-armed bandits, only address selected aspects of such a system. In contrast, the language and the methods of causal inference provide a full set of tools to answer the vast collection of questions facing the designer of such a system. Is it useful to pass new input signals to the statistical models? Is it worthwhile to collect and label a new training set? What about changing the loss function or the learning algorithm? In order to answer such questions, one needs to unravel how the information produced by the statistical models traverses the web of causes and consequences and produces measurable losses and rewards. This talk provides a real world example demonstrating the value of causal inference for large-scale machine learning. It also illustrates a collection of practical counterfactual analysis techniques applicable to many real-life machine learning systems, including causal inferences techniques applicable to continuously valued variables with meaningful confidence intervals, and quasi-static analysis techniques for estimating how small interventions affect certain causal equilibria. In the context of computational advertisement, this analysis elucidates the connection between auction theory and machine learning.

Bio: Leon Bottou received the Diplome d'Ingenieur del'Ecole Polytechnique (X84) in 1987, the Magistere de Mathematiques Fondamentales et Appliquees et d'Informatique from Ecole Normale Superieure in 1988, and a Ph.D. in Computer Science from Universite de Paris-Sud in 1991. Leon joined AT&T Bell Laboratories in 1991 and went on to AT&T Labs Research and NEC Labs America. He joined the Science team of Microsoft adCenter in 2010 and Microsoft Research in 2012. Leon's primary research interest are machine learning. Leon's secondary research interest is data compression and coding. His best known contributions are his work on large scale learning and on the DjVu document compression technology.

Alfio Gliozzo, IBM T.J. Watson Research Center

Title: Beyond Jeopardy! Adapting Watson to New Domains Using Distributional Semantics

Watson is a computer system built to answer rich natural language questions over a broad open domain with confidence, precision, and speed. IBM demonstrated Watson's capabilities in a historic exhibition match on the television quiz show Jeopardy!, where Watson triumphed over the best Jeopardy! players of all time. The new challenge for IBM is to adapt Watson to important business problems and to make this process scalable while requiring minimal effort. In this talk I describe the DeepQA framework implemented by Watson, focusing on the adaptation methodology and presenting new research directions, with emphasis on unsupervised learning technology for distributional semantics linking text to knowledge bases.

WBio: Alfio Gliozzo is a research staff member at the IBM T.J. Watson Research Center. He is currently a technical leader on the DeepQA team, coordinating a research team focused on unsupervised learning from text. At the same time, he is a key contributor of the Watson core technology for domain adaptation. He has been involved in both academic research and industry for 12 years, achieving a significant track record in delivering semantic technologies across different applications, patents and scientific publications.

Joseph Gonzalez, Berkeley

Title: Large Scale Graph-Parallel Computation for Machine Learning: Applications and Systems

From social networks to language modeling, the growing scale and importance of graph data has driven the development of graph computation frameworks such as Google Pregel, Apache Giraph, and GraphLab. These systems exploit specialized APIs and properties of graph computation to achieve orders-of-magnitude performance gains over more general data-parallel systems such as Hadoop MapReduce. In the first half of this talk we review several common data mining and machine learning applications in the context of graph algorithms (e.g. PageRank, community detection, recommender systems, and topic modeling). We then survey the common properties of these algorithms and how specialized graph frameworks exploit these properties in data partitioning and engine execution to achieve substantial performance gains.

In the second half of this talk we revisit the specialized graph-parallel systems through the lens of distributed join optimization in the context of Map-Reduce systems. We will show that the recent innovations in graph-parallel systems can be cast in the context of data-partitioning and indexing enabling us to efficient execute graph computation within a Map-Reduce framework and opening the opportunity to lift tables and graphs to the level of first-class composable views. Finally, we present GraphX, a distributed, fault-tolerant, and interactive system for large-scale graph analytics that is capable of efficiently expressing and executing graph-parallel APIs (e.g., Pregel/Giraph) while at the same time enabling users to switch between table and graph views of the same data without data-movement or duplication.

Bio: Joseph Gonzalez is cofounder of GraphLab Inc. and a postdoc in the AMPLab at UC Berkeley. Joseph received his PhD from the Machine Learning Department at Carnegie Mellon University where he worked with Carlos Guestrin on parallel algorithms and abstractions for scalable probabilistic machine learning. Joseph is a recipient of the AT&T Labs Graduate Fellowship and the NSF Graduate Research Fellowship.

Justin Moore, Facebook

Title:Machine Learning Meets the Crowd

If you could ask a billion people one question, what would it be? Facebook has unprecedented scale, both in terms of active users, but also the computing resources to build massively complex machine learning models. There is significant research and depth into the power and effectiveness of both of these areas but not much has been done to explore the interactions between the two. This talk explores some of the machine learning methods that Facebook is applying to our structured places database, gives an overview of our places crowdsourcing system, and then goes into depth about some of the interactions between these two systems and shows how the combination can be much greater than the sum of its parts.

Derek Murray, Microsoft Research Silicon Valley Lab

Title: Low-latency distributed analytics in Naiad

Iterative algorithms, incremental updates, and interactive queries require low latency from a data processing system, but the distributed nature of big data analytics makes achieving low latency difficult. To that end, we have developed Naiad, a new distributed system that combines asynchronous and fine-grained synchronous execution to achieve low latency while producing consistent results. On top of the Naiad platform, we have developed frameworks for graph processing, machine learning, and incremental iterative computation. In this talk, I will present the Naiad programming model, discuss some of the frameworks and applications built atop that model, and show how those applications running on Naiad can achieve the performance of specialized, domain-specific systems.

Bio: Derek Murray is a researcher at the Microsoft Research Silicon Valley Lab in Mountain View, CA. His principal research interest is developing systems for large-scale distributed computation, with a focus on extending the class of algorithms that can be run efficiently in this setting. He holds a PhD from the University of Cambridge, where he led the development of the CIEL distributed execution engine.

Christopher (Chris) Re, Stanford

Title: The Thorn in the Side of Big Data: Too Few Artists

A new generation of data processing systems, including web search, Google's Knowledge Graph, IBM's Watson, and several different recommendation systems, combine rich databases with software driven by machine learning. The spectacular successes of these trained systems have been among the most notable in all of computing and have generated excitement in health care, finance, energy, and general business. But building them can be challenging even for computer scientists with PhD-level training. If these systems are to have a truly broad impact, building them must become easier. This talk describes our recent thoughts on one crucial pain point in the construction of trained systems feature engineering. For street-art lovers, this talk will argue that current systems require artists, like Banksy, but we need them to be usable by the Mr. Brainwashes of the world. As an example, this talk will also describe some recent work on building trained systems to support research in paleobiology and feature selection for enterprise analytics.

Bio: Christopher (Chris) Re is an assistant professor in the Department of Computer Science at Stanford University. The goal of his work is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. Chris's papers have received four best-paper or best-of-conference citations, including best paper in PODS 2012, best-of-conference in PODS 2010 twice, and one best-of- conference in ICDE 2009). Chris received an NSF CAREER Award in 2011 and an Alfred P. Sloan fellowship in 2013.

Alex Smola, CMU

Title: Better Living with Randomness

Scalable algorithm design requires two components: the ability to distribute problems over several machines and the ability to execute portable and scaling friendly components on single machines. In this talk I will show that randomized techniques can greatly accelerate algorithms and simplify the design. In particular, I will discuss Hash Kernels and Cuckoo Linear Algebra for high-dimensional linear function classes that do not require preprocessing, and FastEx and Alias sampling for fast inference in clustering and topic models.

Ameet Talwalkar and Evan Sparks, Berkeley

Title: MLbase: A User-Friendly System for Distributed Machine Learning

Implementing and consuming Machine Learning (ML) techniques at scale are difficult tasks for ML Developers and Domain Experts. MLbase is a platform addressing the issues of both groups. In this talk, we describe the various components of our system, including a low-level distributed machine learning library in Spark, an experimental API for developing machine learning algorithms and feature extractors, and recent work on higher-level functionality to autotune basic ML pipelines.

Bio: Ameet Talwalkar is an NSF post-doctoral fellow in the Computer Science Division at UC Berkeley. His work addresses scalability and ease-of-use issues in the field of machine learning, as well as applications related to large-scale genomic sequencing analysis. He obtained a bachelor's degree from Yale University and a Ph.D. from the Courant Institute at New York University.

Bio: Evan Sparks is a PhD Student in the Computer Science Division at UC Berkeley. His research focuses on the design and implementation of distributed systems for large scale data analysis and machine learning. Prior to Berkeley he spent several years in industry tackling large scale data problems as a Quantitative Financial Analyst at MDT Advisers and as a Product Engineer at Recorded Future. He holds a bachelor's degree from Dartmouth College.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on March 17, 2014.