Mathematical Aspects of Sequencing by Nested Strand Hybridization

Mikhail S. Gelfand, Anatoly R. Rubinov, Oleg I. Razgulyaev, and Alexander B. Chetverin

Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow region, 142292, Russia

Standard SBH identifies all k-nucleotides that are contained in a DNA sample (k-mer vocabulary). In addition to this information sequencing by nested strand hybridization (SNSH) [A.B.Chetverin and F.R.Kramer] provides data about k-mer precedence. This allows to eliminate difficulties caused by repeating (k - 1)-mers and to reconstruct simultaneously several DNA fragments.

More exactly, consider a set of DNA fragments. From the experiment it is known what pairs of k-mers occur simultaneously in at least one fragment, and in what order. The problem is to reconstruct all sequence fragments given such data. Since this problem is computationally difficut, it is important to develop practically acceptable algorithms.

The simplest case is the single fragment reconstruction problem. There exist an algorithm doing it in linear time. This algorithm can also be modified for the situation of random positive noise.

In the general multiple fragments case the SNSH data allow in realistic situations to determine k-mer subsets constituting vocabularies of individual fragments. Thus the problem reduces to the previous one, where the false positive noise is the result of possible occurrences of the same pairs of k-mers in other fragments of the pool in inverse order.

The number of hybridization events, or array capacity, in SNSH with a fixed k is 4^2k, as in SBH with 2k-mers. However, computational experiments demonstrate that the resolving power of SNSH with k-mers is much higher than that of SBH with 2k-mers, especially in the case of natural sequences where repeats are abundant.

There exist different combinations of these basic ideas. In particular, SBH with 2k-mers can be complemented by SNSH with k-mers in order to derive information about 2k-mer precedence. This can lead to spurious pairs of 2k-mers, and thus the problem again reduces to the single fragment reconstruction given 2k-mer data with false positive noise. This approach allows to increase the length of reconstructed fragments at least an order of magnitude as compared to SBH alone.

Another possibility is to reconstruct a relatively long fragment in two rounds of SNSH: reconstruction of the fragment itself, resulting in multiple local ambiguities, and reconstruction of a pool of shorter fragments produced by restriction of the initial fragment, which eliminates most local ambiguities (except the trivial ones).

The reconstruction algorithms are described in the accompanying poster by Rubinov and Gelfand. They have been implemented and extensive computer simulations are performed, aimed at analysis of feasibility of the method, determination of the best experimental parameters, and comparison with traditional SBH. The results of these simulations are described in the poster by Razgulyaev et al.

This work was supported in part by DOE grant OR00033-93CIS016 and ISF grant MTWOOO.

Program

DIMACS Homepage

Contacting the Center
Document last modified on March 28, 2000.