DIMACS TR: 95-04

DNA Sequence Classification Using Compression-Based Induction



Authors: David Loewenstern, Haym Hirsh, Michiel Noordewier, Peter Yianilos

ABSTRACT

Inductive learning methods, such as neural networks and decision trees, have become a popular approach to developing DNA sequence identification tools. Such methods attempt to form models of a collection of training data that can be used to predict future data accurately. The common approach to using such methods on DNA sequence identification problems forms models that depend on the {\em absolute locations} of nucleotides and assume {\em independence} of consecutive nucleotide locations. This paper describes a new class of learning methods, called {\em compression-based induction} (CBI), that is geared towards sequence learning problems such as those that arise when learning DNA sequences. The central idea is to use text compression techniques on DNA sequences as the means for generalizing >from sample sequences. The resulting methods form models that are based on the more important {\em relative locations} of nucleotides and on the {\em dependence} of consecutive locations. They also provide a suitable framework into which biological domain knowledge can be injected into the learning process. We present initial explorations of a range of CBI methods that demonstrate the potential of our methods for DNA sequence identification tasks.

Paper available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1995/95-04.ps.gz
DIMACS Home Page