CCR/DIMACS Tutorial on Mining Massive Data Sets and Streams: Mathematical Methods and Algorithms for Homeland Defense

June 17 - 19, 2002
Center for Communications Research (CCR), Princeton, NJ

Organizers:
Bob Grossman,chair, University of Illinois and Two Cultures Group, grossman@uic.edu
Paul Kantor, Rutgers University, kantorp@cs.rutgers.edu
Muthu Muthukrishnan, AT&T Labs - Research and Rutgers University, muthu@cs.rutgers.edu

Workshop Coordinator:
Dolores Koch, DJK@idaccr.org
Co-sponsored by the Center for Communications Research (CCR) and DIMACS.

Abstracts:



Stephen G. Eick, Visual Insights

Title: Mining Clickstream Data

The growth of the Internet and the dot com revolution has established the e-channel as a critical component of how corporations interact with their customers. One of the unique aspects of this channel is the rich instrumentation where it is literally possible to capture every visit, click, page view, purchasing decision, and other fine-grained details describing visitor browsing patterns. The problem is that the huge volume of relevant data overwhelms conventional analysis tools. To overcome this problem, we have developed a sequence of analysis tools for understanding website structure, paths and flow through the site, website activity, and a click stream analysis tool called eBizinsights. The most innovative aspect of our tools is a rich visual interactive presentation layer.


Gary William Flake, NEC Research Institute

Title: Web Mining for Hyperlinked Communities

No doubt about it, the web is big. So big, in fact, that many classical algorithms for databases and graphs cannot scale to the distributed multi-terabyte anarchy that is the web. How, then, do we best use, mine, and model this rich collection of data? Arguably, the best approach to developing scalable non-trivial applications and algorithms is to exploit the fact that, both as a graph and as a database, the web is highly non-random. Furthermore, since the web is mostly created and organized by humans, the graph structure (in the form of hyperlinks) encodes aspects of the content, and vice-versa.

This tutorial will survey the systematic regularities found within the link structure of the web and show how many new and popular algorithms (e.g., HITS, PageRank, and the Community Algorithm) exploit these regularities. We will pay particular attention to algorithms that identify communities or clusters of related web pages. We will further consider how many classical algorithms, which were formulated to minimize worst-case performance over all possible problem instances, can be adapted to the more regular structure of the web.


Robert Grossman, University of Illinois at Chicago and Two Cultures Group

Title: An Introduction to High Performance Data Mining for Homeland Defense

A fundamental problem in data mining is to develop data mining algorithms and systems which scale as the amount of data grows, as the dimension grows, and as the complexity of the data grows. There are now parallel versions of some of the standard data mining algorithms, including tree-based classifiers, clustering algorithms, graph algorithms, and association rules. We will cover these algorithms in detail, as well as provide an overview of some of the general techniques and approaches which have been used developing parallel and high performance version of these and other algorithms data mining algorithms.


Paul Kantor, Rutgers University

Title: Data Fusion in Message Filtering

When messages are to be assigned to classes that assignment is based upon features of the message $y$, which are assumed stochastically dependent on the classification. Let $f(y|c)$ be the distribution of the feature vector $y$, when the message belongs to class $c$. When there are two different features, $y$ and $y'$, then study of either feature alone can lead to an optimal decision rule. Data fusion is the effort to exploit multiple sets of features, $y$,$y'$, $y''$ and the rules corresponding to each of them separately. In the most general case, it is an effort to estimate the joint distributions $f(y,y',y'' |c)$ from some knowledge of the marginals. Historically the problem is approached parametrically.

Various approaches can be unified by considering an abstract space in which the axes represent the likelihood ratios of the several classifications, as computed using the several sets of features. With this as a foundation we can explore Boolean, Fuzzy, linear, non-linear and non-parametric approaches to data fusion. Arguments of simplicity, monotonicity, and disorder have all beem applied to solve the underconstrained problem of data fusion. The ultimate test, however, lies with the effectiveness of the result meta-classifier,

Pointers to the relevant literature on both theory and evaluation will be included. This tutorial presumes a basic knowledge of probability (say, Feller, volume 1) and some mathematical sophistication (early graduate student).


David D. Lewis, Independent Consultant

Title: Classification and Mining of Text Data

Natural language text violates every precept of good data modeling: it is ambiguous, redundant, high-dimensional, and has complex and poorly defined structure and semantics. This tutorial will present methods for organizing and analyzing text data in the face of these problems.

You will learn about techniques for classifying natural language of all sorts and finding patterns within and among textual items. Examples will be presented from a variety of real-world text classification and mining applications. The emphasis will be on statistical and machine learning techniques, with some discussion of the role of linguistic processing.


Andrew Moore, Carnegie Mellon University

Title: Introduction to Data Mining

This tutorial will survey the fundamental ideas from computer science and statistics that combine together to give data mining. We will see examples of some of the most popular data mining algorithms in action and we will briefly review the key points that make them work. We'll visit decision trees, association rules, non-linear regression, Support Vector Machines and Bayesian approaches. We will then pay careful attention to some classic pitfalls of careless data mining, including overfitting, counfounding causation with correlation, and multiple testing. We will finish with some examples of the kind of work currently taking place in academic and industrial data mining research, including a survey of approaches used to scale statistical data mining to very large data sets, and a couple of exotic models being used in current Homeland Defense data mining applications (Surveillance for bioweapon use, and terrorist network analysis).


S. Muthu Muthukrishnan, AT&T Labs - Research and Rutgers University

Title: Streaming Data

Say you go fishing. There are many different types of fish and you catch a bounty. At the end of the day, how can you quickly estimate the number of fish types that are only a few in your catch or those that are the most abundant? Suppose now that your colleague goes fishing and also gets a bountiful catch: how do you quickly check if the fish types you have caught are similar? These problems are interesting when you can not remember all the fish types in your catch, and do not have the time to sort through them for answering the questions above.

Fishing puzzles above are examples of problems that arise when one mines streaming data logs. The Theoretical Computer Science community has recently produced models, algorithms and complexity results to reason about processing streaming data, be it for estimating statistical parameters, clustering or accurate summarization of the streaming signals. In this tutorial, I will present an overview of this emerging theory. Problems become hard when one throws some of the catch back into the pond!


Robert Schapire, AT&T

Title: Machine Learning Algorithms for Classification

Machine learning studies the design of computer algorithms that automatically make predictions about the unknown based on past observations. Often, the goal is to learn to categorize objects into one of a relatively small set of classes. This tutorial will introduce some of the main state-of-the-art machine learning techniques for solving such classification problems, namely, decision trees, boosting and support-vector machines. The tutorial will also discuss some of the key issues in classifier design, including avoidance of overfitting.


Manfred Warmuth, University of California, Santa Cruz

Title: On-line Learning

We consider learning from examples. We start with a parameterized model class and a loss function that assigns each example and model a non-negative loss.

The on-line algorithm sees one example at a time and incurs a loss on the current example based on its current model. This model (hypothesis) is updated on-line as more examples are seen by the learner. The best fixed model is chosen off-line. It is the model in the class with the smallest (total) loss on all examples.

The loss of the on-line algorithm on a sequence of examples is typically larger than the loss of the best off-line model. However, the goal of the on-line learner is to minimize the additional loss of the on-line algorithm over the loss of the best off-line model. Thus the off-line model serves as a comparator. Bounds relating the on-line loss to the best off- line loss are called relative loss bounds. Such bounds quantify the price of hiding the future examples from the learner.

We will review several methods for deriving on-line algorithms and proving relative loss bounds for them.

Finally we discuss the case when the off-line comparator is allowed to ``shift'' over time. Now the off-line algorithm is allowed to partition the data stream into a small number section and choose the best hypothesis for each section. In contrast, the on-line algorithm does not know the best partition of the data. The algorithms is faced with a dilemma: Should it learn from the recent examples or should it hold on to its old hypothesis.

We show how to resolve this dilemma in various ways and give algorithms with good bounds on the additional loss of the on-line algorithm over the loss of the best ``shifting'' off-line model.

Various applications of these methods will be discussed.