DIMACS 1994-00 Special Year on Molecular Biology

The title of the special year, ``Mathematical Support for Molecular Biology,'' is certainly too broad to give a specific idea of the focus of the year. Indeed, virtually every area of molecular biology has some inherently mathematical problems which need solving, and many areas of computer science and mathematics have something to contribute to the biological sciences. Clearly, not all topics which might constitute ``mathematical support for molecular biology'' can be covered in one ``special year'' nor is it appropriate for DIMACS to attempt to address all such issues. DIMACS has a specific focus, associated with discrete mathematics and theoretical computer science. Discrete mathematics is concerned with designs, patterns, sequences, strings, arrays, ... It is concerned with the existence of these discrete structures, with their identification and characterization, and with optimization problems associated with them. Discrete mathematicians interact with theoretical computer scientists, and both communities feed off of each other in their efforts to develop algorithms for solving problems concerned with discrete structures. The tools of discrete mathematics and theoretical computer science involve such areas of mathematics as combinatorics, graph theory, and probability theory.

Discrete structures have been used to formulate some of the most fundamental concepts of molecular biology. It is now well-known that information storage within a cell is by means of long nucleic molecules which can be thought of as long strings of smaller units called nucleotides. Some of the most important problems of modern molecular biology involve combinatorial and algorithmic questions about these strings. Models of proteins, DNA, and RNA molecules as sequences form a basic tool of molecular biology.

Throughout its history, molecular biology has interacted with methods of discrete mathematics and theoretical computer science. The first nucleic acid sequence was determined in 1965 by R.W. Holley and his co-workers at Cornell University (it contained 77 bases). The method used was the fragmentation stratagem, which involved the combinatorial problem of reconstructing an unknown sequence from fragments obtained by certain enzyme decompositions. Graph-theoretical methods, using eulerian chains, were developed by Hutchinson to give algorithms for implementing the fragmentation stratagem. Of course, these methods are not used any more, and indeed were used only for a short time before other, more efficient methods were adopted. However, they played an important role in the development of molecular biology.

Overlap data arising from the study of mutations led Cal Tech geneticist Seymour Benzer in the late 1950's to pose a graph-theoretical question that led to the study of interval graphs. Characterizations of interval graphs led Benzer to conclude that his overlap data was consistent with the hypothesis of linear gene structure, and played a basic role in establishing the idea that DNA or RNA molecules can be viewed as linear words over a 4-letter alphabet. Interval graphs continue to be important today in connection with the study of restriction maps of DNA chains. Their generalizations to higher dimensions, in particular the 2-unit sphere graphs, arise in biochemistry in problems of macro-molecular conformation.

Because much of modern molecular biology is concerned with information transmission, it should be no surprise that the mathematical methods that have been most useful in the development of information science are perhaps the most useful in the development of molecular biology. It is no accident that DIMACS was founded as a consortium whose two industrial members, AT Bell Laboratories and Bellcore, are concerned with information transmission.

The fundamental premise of our special year is that some of the most central problems in molecular biology are essentially problems involving the combinatorial and algorithmic questions that DIMACS-type researchers are good at solving. It is a basic scientific objective of the year to create partnerships between biological scientists and discrete mathematicians/theoretical computer scientists so that some very important questions of molecular biology can be properly and precisely formulated as mathematical problems. It is the hope of the special year organizers that, by bringing together some of the world's leading discrete mathematicians/theoretical computer scientists, some major progress can be made on these problems. Inevitably, some of this ``progress'' will be purely of a mathematical nature. However, it is our hope that we can establish and nurture lines of communication and collaboration so that the mathematical results that arise from the special year will then be modified and revised so that many of them will be quickly communicated to biological scientists and so that biological data and biological questions will continue to play a central role in refinements of the mathematics.

In planning special years, DIMACS has had a tradition of maintaining flexibility so that as the special year progresses, the scientific focus can change. Still, it is necessary to set an agenda before the year begins, and we try to do that by planning some major activities that give the year focus. In this special year, there are two major activities of this kind, the workshops and the algorithm implementation challenge. We have chosen the workshop topics in areas with a proven history of biological relevance and discrete mathematical structure. We will use a series of mini-workshops to be more exploratory, leaving many of them until later to be planned.

We are planning five main workshops of three to five days duration. The fifth will be the culmination of our algorithm implementation challenge, and is a bit different from the others. Generically speaking, the goals for each of our other four main workshops are similar: Get biologists and mathematical scientists together to try to establish long-term collaborations, have them work together to identify mathematical and algorithmic problems that are of interest to biology, and get mathematical scientists with existing or potential interest in those problems together to work on them, together with biologists, both at the workshop and afterwards. In order to achieve these goals, we have involved both mathematical scientists and biological scientists as chief organizers or principal advisors of each of the main workshops and the organizing committees for these workshops will consist of both leading mathematical scientists and leading biological scientists. The organizing committees will be or have been in on the planning of the workshops and their members will be participating as speakers and discussants.

The first three workshops have all been chosen to reflect biological areas with existing important underlying combinatorial themes.

First Workshop: Combinatorial Methods for Mapping and Sequencing DNA Mapping the human genome would require localizing each of its genes and sequencing it would require determining the exact order of the nucleotides making up each gene. Even the intermediate goal of mapping the genome presents significant computational challenges. The workshop will be concerned with algorithmic problems arising from mapping and sequencing, for example problems involving fragment assembly, particularly in the presence of noisy data, and approximate string matching. We expect to have sessions on the double digest and partial digest problems, large-scale DNA physical mapping, DNA sequence assembly, DNA sequencing by hybridization and the shortest superstring problem, new approaches to DNA mapping and sequences, computational support of large sequencing projects (databases, sequence accuracy, automation), and open problems in DNA mapping and sequencing. The topics of the workshop were chosen to be closely tied to the issues being considered in our Algorithm Implementation Challenge. The workshop will be used to kick off the Challenge and we will devote considerable time in the workshop to implementation issues.
Second Workshop: Sequence Alignment This workshop will deal with string alignment problems in which one wishes to gain information about a newly sequenced piece of DNA by comparing, or aligning it, with a sequence of known function or structure. Detection of similarity between two different molecular sequences has led to the discovery of shared phenomena. (We have already referred to the discovery that the sequence for platelet derived factor, which causes growth in the body, is 87 identical to the sequence for v-sis, a cancer-causing gene, which led to the discovery that v-sis works by stimulating growth.) The quality of a match between two sequences can be determined by a scoring matrix and a charge for introducing gaps in one of the sequences to get a better match, and then a dynamic programming algorithm can be used to determine the largest number of places where two sequences (gaps added) agree. The theory of random graphs can be used to compare two random sequences and predict how good a match one can expect. Work on determining similarity between pairs of sequences can be expanded to work on detecting matches among a whole cluster of such sequences, and then algorithms or heuristics for determining clique-like structures in corresponding graphs can be useful for finding patterns. In this workshop, we will explore all of these ideas. We will investigate dot-matrix methods, global alignments, local alignments and hash coding methods, multiple alignments, measures of aminoacid similarity, and statistical significance of alignments. This workshop will also be closely coordinated with our Algorithm Implementation Challenge.
Third Workshop: Phylogeny The study of evolution has become rather more quantitative since the recombinant DNA revolution. The phylogeny of a set of species is a model of the evolutionary relationship of that set. Almost all such models in use can be described as trees or tree-like structures that are among the most basic structures of discrete mathematics. The study of phylogenetic tree construction is an important area of biology, with significant statistical and algorithmic components. These two facets play off each other since one is often forced into a tradeoff between the statistically relevant and the computationally tractable. The workshop will be concerned with combinatorial, algorithmic, and statistical aspects of phylogenetic tree construction. We will be interested in finding parsimonious phylogenies (using either Steiner minimal trees or minimal spanning trees), with searches through phylogenetic solution space, with modeling operational taxonomic units as either nodes of trees or both nodes and branches, with cophenetic correlation measures of how well a particular tree matches raw data, and with discrete Fourier methods. We will investigate tree reconstruction based on genome orders, speciation, fossil record and inference about divergence times, non-tree structures like those proposed by Dress and others, and the effects of recombination. We will discuss the methods vthat are available for tree reconstruction with many different sorts of molecular data all at the same time -- a very important practical issue. It is our goal to address strengths and weaknesses of construction methods (distance methods, parsimony, statistical (such as maximum likelihood), and invariants); computational issues involving software and supercomputing applications; and algorithmic issues; and to set an agenda for future work in computational phylogeny.
Fourth Workshop: HIV Sequence Analysis Although the generic goals of the fourth workshop are similar to those of the first three, this fourth workshop is different in character because it will be devoted to a specific applied topical area rather than a general one. It is concerned with HIV sequence data, and will have several days devoted to presentations of experimental data and a much shorter time devoted to methodological issues. Studies of HIV sequence variation involve global variation, local variation across individuals, tissue or cell tropism, variation of the transmitted virus between sexual partners or mothers and infants, variation of the quasi-species within an infected individual, and covariation of different sites along the virus. Current methods for analyzing the available information are not fully understood. Combinatorial methods have started to play an important role in the analysis and modeling of HIV sequence data. Among the combinatorial issues involved here are the theory of random trees, the development of measures of variability for trees, the study of measures of distance between trees, and the study of statistical confidence in a constructed phylogeny. Thus, techniques investigated in our first three workshops will have some bearing on the subject matter of the fourth workshop.
Algorithm Implementation Challenge Throughout the special year, we will be emphasizing the development of algorithmic methods for the solution of problems of molecular biology. One of the central activities will be an Algorithm Implementation Challenge, in which researchers will be developing algorithmic methods for a series of benchmark problems developed in conjunction with molecular biologists. All of the specifics of the Challenge have not yet been worked out. However, we have decided that at least part if not all of it will be devoted to problems of DNA sequence determination from shotgun sequence data. There are several features to the shotgun sequencing problem that make it appropriate for the special year. Most important is that it is very relevant to molecular biology and the genome projects. Scores of groups around the world are working on this problem, from both theoretical and applied points of view. Shotgun sequence assembly encompasses aspects of sequence comparison and multiple sequence alignment and is closely related to the first two workshops on Mapping and Sequencing and on Sequence Alignment. Issues of implementation will be stressed at these two workshops, which will be used to lay the groundwork for the Challenge. The goal will be to involve a wide variety of computer scientists in the development of algorithms for carefully defined problems developed between mathematical scientists and biological scientists. Intermediate feedback to algorithm developers will be provided during the course of the year, with comments about both computational aspects and biological relevance.
Fifth Workshop: Algorithm Implementation Challenge Workshop The ``Challenge'' will culminate in a workshop in September of the second year of our project. Participants will describe their algorithmic approaches to the challenge problems, implementation issues which arose will be debated and discussed, and new directions and further requirements for improvement will be developed.
While our workshops and algorithm implementation challenge will provide the main scientific focus for the year, a series of one-day mini-workshops will give us the opportunity to explore a variety of additional topics. We do not intend to specify in advance all of the mini-workshop topics. Indeed, we hope that during the course of the planning for the special year and even during the year, we will find that some topics are of special importance and merit a mini-workshop. Mini-workshops can be used to explore topics that might not be as clearly well advanced in the field of computational biology or math/cs topics that are of potential but not proven significance for molecular biology. Also, they can be used to explore molecular biological topics where the payoff to molecular biology of math/cs methods may not be so clear.
Mini-workshop on Combinatorial Structures in Molecular Biology We are planning on two mini-workshops to be organized by mathematical scientists and to center around mathematical methods. One is on combinatorial structures in molecular biology, and will simply explore a variety of combinatorial questions that relate to or come from molecular biology. Although biologists will be involved in the organization of this mini, it is mostly directed at enticing mathematicians to get involved with the interesting mathematical problems that arise.
Mini-workshop on Antibody Sequence and Structure A second mini organized around mathematical topics will emphasize combinatorial problem areas in the study of antibody sequence and structure. This mini will be different from the other mathematical one in its emphasis on problems arising from a more narrow area of molecular biological application.
Mini-workshop on Gene Finding and Gene Prediction A third mini that is already planned emphasizes methods for gene finding and gene prediction. It will be concerned with the increasingly important activity in computational biology of discovering protein-encoding genes in otherwise uncharacterized primary sequence data. This has traditionally been done by discriminating likely coding regions based on a variety of statistical analyses and to a lesser extent by detection of landmark sequences such as splice junctions. The methods to be explored will center around methods of artificial intelligence such as syntactic pattern recognition (which has recently been useful in the problem of assembly of exonic regions into plausible genes) and neural nets. There will also be considerable attention paid to methods of information theory, signal processing, and dynamic programming. This mini will explore methods that are perhaps somewhat peripheral to those described in our summary of the major workshops, but are of increasing interest to those in the computational biology community. We will discuss the most recent efforts to discriminate compositional and signal cues from sequence data and to convey them to integrators, and we will be comparing frameworks and algorithms for gene assembly and combination of evidence.
Mini-workshop on DNA Topology and Regulation We have planned a mini-workshop on DNA topology and regulation. DNA in living systems is held by other molecules in a sequence of loops, within each of which a topological constraint is imposed. This constraint is the constancy of the linking number, the number of times either strand of the DNA double helix links through the loop formed by the other strand. Enzymes regulate the imposed linking number by actions which involve the transient cutting of one or both strands, followed by strand passage or rotation and reclosure. Changes of linking number modulate virtually all the important biological activities of DNA, including the initiation of gene expression and of DNA replication. Understanding this phenomenon in detail is an essential part of elucidating the content and the functions of the genomic message. The workshop will focus on topologically constrained DNA and study topological and geometric properties of DNA loops and computational strategies to analyze and predict the response of a DNA loop to alterations of its linking number. Mathematical methods to be explored will include Monte Carlo techniques and calculation of partition functions, methods somewhat different than those emphasized in our main workshops, but yet which are becoming increasingly important.
Mini-workshop on Global Minimization of Nonconvex Energy Functions In addition to the Mini-Workshop on Antibody Sequence and Structure, we are planning several other mini-workshops in the general areas of protein structural analysis and prediction. While these have been a major area of computational biology for years, most approaches to them have used numerical analysis rather than combinatorial methods. These numerical methods have been applied to models of the protein at the atomic level with an attempt to optimize for the minimum energy state. Even within this traditional approach, some optimization methods related to those of discrete mathematics have been finding increasing application. Since in almost all cases the potential energy function is nonconvex and therefore has many local minimizers, the minimization of the potential energy function is a very hard problem. For many years, this has been a field of chemists and physicists, but in recent years, many researchers from optimization and computer science have paid close attention since potential energy minimization problems are ideal as test problems for global minimization algorithms. A mini-workshop on the topic of global minimization of nonconvex energy functions will give us an opportunity to create a dialogue between the theoretical computer scientists developing global minimization algorithms and the structural biologists (some of whom are in the Rutgers Chemistry Department) who are interested in protein folding.
Mini-workshops on Geometrical Methods for Conformational Modeling and on Sequence-Based Methods for Protein Folding In the last several years, the focus on protein structure and folding has begun to shift from physical modeling to pattern matching. What has happened is that, against initial expectation, the structure of enough proteins has been learned that there are definite patterns that are observable and that have great predictive value both for structure and function. For instance, recently, some researchers have started exploring some subtasks in protein structure prediction from a discrete algorithmic point of view. Investigators have had some initial success on some problems of protein geometry, e.g., three-dimensional surface matching, and in predicting or detecting structure from sequence data, e.g., motif recognition from sequence data. Thus, the study of protein structure is becoming much more the study of sequence and critical conserved features from sequence, and that makes this topic very much a topic where discrete mathematicians can make contributions. These ideas are far from mature and their long-term relevance is unclear. Still, a mini-workshop or two will give us an inexpensive way to explore the area and interchange ideas.
Possible Mini-workshop on Database Aspects of Biological Data As we have observed earlier, the existence of libraries of DNA and protein sequences allows researchers to attempt a variety of important tasks such as seeking genes that may be similar to a newly determined sequence. There are many potentially complicated and daunting problems raised by biological databases. Access problems, error correction and detection problems, and file management all present issues of traditional importance in theoretical computer science. We are exploring the possibility of a mini-workshop on the topic of database aspects of biological data.
Possible Mini-workshop on RNA Structure The three dimensional conformation assumed by RNA is crucial to the functioning of the cell. In contrast with proteins, RNA bases pair up to form double stranded helixes, and so much of the three dimensional structure of RNA can be described by the two dimensional information about base pairings. There has therefore been a long tradition of attacking this problem via discrete algorithmic approaches, since the secondary structure is discrete -- a list of paired bases -- rather than continuous -- real coordinates of specific molecules. We are considering a mini-workshop that would explore the traditional dynamic programming approaches to this problem, as well as looking at novel algorithmic approaches.
We hope that this discussion has provided an overview of the scientific topics upon which the special year will concentrate. Previous DIMACS special years have shown that when we mix different communities of researchers, we don't know exactly where we will end up. However, we hope that we have organized the year, given it sufficient focus, and involved both mathematical and biological communities in its planning, so that we have a good chance of attaining the kinds of successes earlier DIMACS special years have attained.

Return to Special Year in Mathematical Support for Molecular Biology Home Page

Return to DIMACS Home Page

Contacting the Center
Document last modified on February 2, 2000.