DIMACS TR: 2002-21

Estimating Rarity and Similarity over Data Stream Windows

Authors: Mayur Datar and S. Muthukrishnan


In the windowed data stream model, we observe items coming in over time. At any time $t$, we consider the window of the last $N$ observations $a_{t-(N-1)},a_{t-(N-2)},\ldots,a_t$, each $a_i \in \{1,\ldots,u\}$; we are allowed to ask queries about the data in the window, say, we wish to compute the minimum or the median of the items in the window. A crucial restriction is that we are only allowed $o(N)$ (often polylogarithmic in $N$) storage space, that is, space smaller than the window size, so the items within the window can not be archived. Window data stream model arose out of the need to formally reason about the underlying data analyses problems in applications like inter-networking and transactions processing.

In this paper, we study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. While previous work has studied simple distributional parameters such as the number of distinct items in the window, no prior work has addressed the general problem of estimating the rarity. Our second problem is one of estimating similarity between two data stream windows using the Jacard's coefficient. Prior work has focused on $L_p$ norms and set similarity measures such as the Jacard's coefficient have been studied before in the windowed data stream model. The problems of estimating rarity and similarity have many applications in mining massive data.

We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor $1 \pm \epsilon$ using space only logarithmic in the window size. In both cases, our solutions are based on modifications of the powerful min-wise hashing technique. We expect our solutions to find applications in practice.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2002/2002-21.ps.gz

DIMACS Home Page