DIMACS Summer School Tutorial on New Frontiers in Data Mining

August 13 - 17, 2001
Rutgers University, Piscataway, NJ

Organizers:: Dimitrios Gunopulos, University of California at Riverside, dg@cs.ucr.edu; Nikolaos Koudas, AT&T Labs - Research, koudas@research.att.com

Presented under the auspices of the Special Foci on Data Analysis and Mining and Computational Molecular Biology.

Abstracts:


1.

Data Quality Assurance in Network Databases
Chung-Min Chen and Munir Cochinwala, Telcordia Technologies

Operation Support Systems (OSS), which support
a telecommunication carrier's network
operations, usually maintain large databases that model
physical networks and their components.
The issue of data quality is to
ensure that the data are correct, current and
consistent in the databases. This is vital to the
efficiency and effectiveness of the operations.
In the talk, we will discuss related issues and
approaches on how to automate and assure network
database quality.



2.

MINING VERY LARGE DATA STREAMS

Pedro Domingos
Department of Computer Science and Engineering
University of Washington

In many domains, data now arrives faster than we are able to mine it.
In some cases (e.g., large networks), merely storing all the data produced
would be prohibitively expensive. To avoid wasting this data, we must switch
from the traditional ``one-shot'' data mining approach to systems that are
able to mine continuous, high-volume, open-ended data streams as they
arrive. In this talk I will describe a general method for transforming batch
data mining algorithms into data-stream ones and its application to
decision tree induction, k-means clustering and the EM algorithm. I will
provide analytical guarantees that these algorithms produce in finite time
results equivalent to mining infinite data (to within epsilon, with
probability one minus delta), and examples of their practical performance.
For example, our decision-tree learner is able to incorporate one billion
examples per day using off-the-shelf hardware. Its extension to non-stationary
data leads to speedups of four orders of magnitude over traditional windowing
methods, for similar predictive accuracy.

Joint work with Geoff Hulten.



3.

Trajectory Sampling for Direct Traffic Observation
Matt Grossglauser, AT&T Labs - Research

The Internet is vast and difficult to model. Estimation of the
state of even a single operator's domain is hampered by scale and
uncertainty. We discuss some implications for traffic measurement,
which is a critical component for the control and
engineering of IP networks. More specifically, traffic engineering,
capacity planning, and troubleshooting can benefit from 
knowledge of the spatial flow of traffic through an operator's domain, 
i.e., the paths followed by packets between any ingress and egress point.

We argue that existing traffic instrumentation techniques
are inadequate, because they require network state estimation to infer
the spatial flow of traffic. We propose a method that allows the 
direct observation of traffic flows through
a domain by observing the trajectories of a subset of all packets
traversing the network.  The main advantages of the method are that 
(i) it does not rely on routing state, (ii) its implementation cost is small, 
and (iii) the measurement overhead is modest and can be controlled precisely.

Joint work with Nick Duffield, AT&T Labs - Research.



4.

Network Aware Clustering: Technique and Applications
Balachander Krishnamurthy
AT&T Labs-Research
http://www.research.att.com/~bala/papers

Being able to identify the groups of clients that are responsible for
a significant portion of a Web site's requests can be helpful to both
the Web site and the clients. It is beneficial to move content closer
to groups of clients that are responsible for large subsets of 
requests to an origin server. A grouping of clients, called Clusters,
that are close together topologically and likely to be under common 
administrative control were introduced last year, using 
"network-aware" techniques. Experimental results show that our 
entirely automated approach is able to identify clusters for 99.9% 
of the clients in a wide variety of Web server logs. Sampled results
show that the identified clusters can be validated in over 90% of 
the cases. We are also able to detect unusual access patterns made 
by spiders and (suspected) proxies.

In this talk I will discuss clusters and the range of applications 
they have been used for in the networking research community.

This is joint work (primarily) with Jia Wang.



5.

Prabhakar Raghavan
Verity Inc.

Social Networks from Web Mining to Enterprise Portals

Social Networks have been recognized for some time as key mechanisms
for information sharing and dissemination. We begin by reviewing both 
classical and recent (web-derived) mining and knowledge discovery algorithms,
viewing them in the context of social networks. A recurrent phenomonon
in these settings is the presence of power-law distributions. We postulate 
a stochastic model for these, and present empirical results on text
frequency distributions that suggest new methods for mining text associations.



6.

Rahul Singh, Exelixis

An Overview of Computational Knowledge Discovery and 
Pattern Analysis Problems in Contemporary Drug Discovery 
and Design

Exploring the relationship between the structure of a 
molecule and its bio-chemical properties constitutes the 
basis of drug discovery. State-of-the-art approaches to
this investigation involve a combination of techniques 
that include physical enumeration and tests based on 
combinatorial chemistry and high-throughput screening 
as well as rational pharmaceutical design based on geometric 
and chemical characteristics of molecule-molecule interaction. 
Furthermore, understanding and optimizing factors like the 
effect of the compound on the body and the effect of the
body on the compound are essential in developing a drug. 
Given the exploratory nature of drug discovery, the data volume, 
and the multiple data modalities, it is therefore, not
surprising that the area is rich in algorithmic problems 
related to knowledge discovery, pattern analysis, and efficient 
computability. In this talk, I will attempt to provide an
overview of the drug discovery process and present salient 
problems that are related to the aforementioned computational 
domains.

Previous: Participation

Next: Registration

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on July 24, 2001.

DIMACS Summer School Tutorial on New Frontiers in Data Mining

August 13 - 17, 2001 Rutgers University, Piscataway, NJ

Abstracts:

August 13 - 17, 2001
Rutgers University, Piscataway, NJ