One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM).
Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared.
We also report lower entropy estimates for coding regions
extracted from a large collection of non-redundant human genes.
The conventional estimate is 1.92 bits. Our model produces
only slightly better results (1.91 bits) when considering
nucleotides, but achieves 1.84-1.87 bits when the prediction
problem is divided into two stages: i) predict the next amino
acid based on inexact polypeptide matches, and ii) predict the
particular codon. Our results suggest that matches at
the amino acid level play some role, but a small one, in determining
the statistical structure of non-redundant coding sequences.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-51.ps.gz