Genome Map Assembly using Noisy Data

Anthony Bonner and Eric Harley

University of Toronto, Department of Computer Science, Toronto, Canada M5S 1AI

A common step in the analysis of a genome is the generation of a physical map, i.e., an ordering of overlapping fragments of the genome and a specification of the binding sites of certain probes. An important data structure for this problem is a graph in which each node represents a DNA fragment. Edges represent experimental or derived relationships between the fragments, such as the overlap of two fragments. For data derived from ideal experiments, interval graphs are particularly important, since they capture the linear structure of a chromosome. Many problems that are NP-complete for general graphs, have efficient algorithms for interval graphs. Unfortunately, interval graphs rarely arise in practice because many non-linearities are introduced by experimental error and noise in the data. Nevertheless, the graphs are not arbitrary, and they are derived from a linear structure (a chromosome). The challenge is to understand the sources of non-linearity and to extract a linear genome map.

To address this problem, we are developing a system that attacks map assembly from three angles: logic programming, data visualization and graph theory. A key idea in our approach is that the layout and display of graphs can ameliorate many of the problems caused by problematic data. Our implementation platform is Hy+, a data visualization system developed at the University of Toronto. Hy+ integrates a logic- programming backend with numerous graph layout algorithms. We have tested the system on real and simulated mapping data provided by the Whitehead Institute/MIT Center for Genome Research. Because the initial results are encouraging, Whitehead has installed the system in their laboratory for further evaluation.

The data we are using contains two main sources of experimental error: chimerism and false negatives. Informally, chimerism leads to graphs that are a collection of "linear" subgraphs tied together by "weak" edges. It is not difficult to identify the weak edges and to break up the graph into many connected components, each having a linear structure and representing a contiguous portion of DNA (a contig). Unfortunately, because of the high rate of false negatives in the data (up to 40%), these components are not interval graphs. False negatives also lead to a high degree of ambiguity, so that many different maps may be consistent with the experimental data, even for a single contig. Situations easily arise in which n data points have 2ⁿ possible maps.

These problems lead to several computational challenges: (i) to succinctly represent the large number of maps that are consistent with the data; (ii) to efficienty answer queries about these maps; and (iii) to develop algorithms and heuristics for exploiting the underlying linear structure of the graphs, especially algorithms for display and layout. Robust algorithms for interval graphs should be useful since although our graphs are not interval graphs, they are "approximately" interval and "locally" interval. Formalizing these notions, and the notion of "linear structure", is an important open problem.

Program

DIMACS Homepage

Contacting the Center
Document last modified on March 28, 2000.