DIMACS 1994-00 Special Year on Molecular Biology
The title of the special year, ``Mathematical Support for
Molecular Biology,'' is certainly too broad to give a specific idea of
the focus of the year. Indeed, virtually every area of molecular
biology has some inherently mathematical problems which need solving,
and many areas of computer science and mathematics have something to
contribute to the biological sciences. Clearly, not all topics which
might constitute ``mathematical support for molecular biology'' can be
covered in one ``special year'' nor is it appropriate for DIMACS to
attempt to address all such issues. DIMACS has a specific focus,
associated with discrete mathematics and theoretical computer science.
Discrete mathematics is concerned with designs, patterns, sequences,
strings, arrays, ... It is concerned with the existence of these
discrete structures, with their identification and characterization,
and with optimization problems associated with them. Discrete
mathematicians interact with theoretical computer scientists, and both
communities feed off of each other in their efforts to develop
algorithms for solving problems concerned with discrete structures.
The tools of discrete mathematics and theoretical computer science
involve such areas of mathematics as combinatorics, graph theory, and
probability theory.
Discrete structures have been used to formulate some of the
most fundamental concepts of molecular biology. It is now well-known
that information storage within a cell is by means of long nucleic
molecules which can be thought of as long strings of smaller units
called nucleotides. Some of the most important problems of modern
molecular biology involve combinatorial and algorithmic questions
about these strings. Models of proteins, DNA, and RNA molecules
as sequences form a basic tool of molecular biology.
Throughout its history, molecular biology has interacted with
methods of discrete mathematics and theoretical computer science. The
first nucleic acid sequence was determined in 1965 by R.W. Holley and
his co-workers at Cornell University (it contained 77 bases). The
method used was the fragmentation stratagem, which involved the
combinatorial problem of reconstructing an unknown sequence from
fragments obtained by certain enzyme decompositions.
Graph-theoretical methods, using eulerian chains, were developed by
Hutchinson to give algorithms for implementing the fragmentation
stratagem. Of course, these methods are not used any more, and indeed
were used only for a short time before other, more efficient methods
were adopted. However, they played an important role in the
development of molecular biology.
Overlap data arising from the study of mutations led Cal Tech
geneticist Seymour Benzer in the late 1950's to pose a
graph-theoretical question that led to the study of interval graphs.
Characterizations of interval graphs led Benzer to conclude that his
overlap data was consistent with the hypothesis of linear gene
structure, and played a basic role in establishing the idea that DNA
or RNA molecules can be viewed as linear words over a 4-letter
alphabet. Interval graphs continue to be important today in
connection with the study of restriction maps of DNA chains. Their
generalizations to higher dimensions, in particular the 2-unit sphere
graphs, arise in biochemistry in problems of macro-molecular
conformation.
Because much of modern molecular biology is concerned with
information transmission, it should be no surprise that the
mathematical methods that have been most useful in the development of
information science are perhaps the most useful in the development of
molecular biology. It is no accident that DIMACS was founded as a
consortium whose two industrial members, AT Bell Laboratories and
Bellcore, are concerned with information transmission.
The fundamental premise of our special year is that some of
the most central problems in molecular biology are essentially
problems involving the combinatorial and algorithmic questions that
DIMACS-type researchers are good at solving. It is a basic
scientific objective of the year to create partnerships between
biological scientists and discrete mathematicians/theoretical computer
scientists so that some very important questions of molecular biology
can be properly and precisely formulated as mathematical problems. It
is the hope of the special year organizers that, by bringing together
some of the world's leading discrete mathematicians/theoretical
computer scientists, some major progress can be made on these
problems. Inevitably, some of this ``progress'' will be purely of a
mathematical nature. However, it is our hope that we can establish
and nurture lines of communication and collaboration so that the
mathematical results that arise from the special year will then be
modified and revised so that many of them will be quickly communicated
to biological scientists and so that biological data and biological
questions will continue to play a central role in refinements of the
mathematics.
In planning special years, DIMACS has had a tradition of
maintaining flexibility so that as the special year progresses, the
scientific focus can change. Still, it is necessary to set an agenda
before the year begins, and we try to do that by planning some major
activities that give the year focus. In this special year, there are
two major activities of this kind, the workshops and the algorithm
implementation challenge. We have chosen the workshop topics in areas
with a proven history of biological relevance and discrete mathematical
structure. We will use a series of mini-workshops to be more
exploratory, leaving many of them until later to be planned.
We are planning five main workshops of three to five days
duration. The fifth will be the culmination of our algorithm
implementation challenge, and is a bit different from the others.
Generically speaking, the goals for each of our other four main
workshops are similar: Get biologists and mathematical scientists
together to try to establish long-term collaborations, have them work
together to identify mathematical and algorithmic problems that are of
interest to biology, and get mathematical scientists with existing or
potential interest in those problems together to work on them,
together with biologists, both at the workshop and afterwards. In
order to achieve these goals, we have involved both mathematical
scientists and biological scientists as chief organizers or principal
advisors of each of the main workshops and the organizing committees
for these workshops will consist of both leading mathematical
scientists and leading biological scientists. The organizing
committees will be or have been in on the planning of the workshops
and their members will be participating as speakers and discussants.
The first three workshops have all been chosen to reflect
biological areas with existing important underlying combinatorial
themes.
- First Workshop: Combinatorial Methods for Mapping and Sequencing DNA
Mapping the human genome would require localizing each of its
genes and sequencing it would require determining the exact order of
the nucleotides making up each gene. Even the intermediate goal of
mapping the genome presents significant computational challenges. The
workshop will be concerned with algorithmic problems arising from
mapping and sequencing, for example problems involving fragment
assembly, particularly in the presence of noisy data, and approximate
string matching. We expect to have sessions on the double digest and
partial digest problems, large-scale DNA physical mapping, DNA
sequence assembly, DNA sequencing by hybridization and the shortest
superstring problem, new approaches to DNA mapping and sequences,
computational support of large sequencing projects (databases,
sequence accuracy, automation), and open problems in DNA mapping and
sequencing. The topics of the workshop were chosen to be closely tied
to the issues being considered in our Algorithm Implementation
Challenge. The workshop will be used to kick off the Challenge and we
will devote considerable time in the workshop to implementation
issues.
- Second Workshop: Sequence Alignment
This workshop will deal with string alignment problems in
which one wishes to gain information about a newly sequenced piece of
DNA by comparing, or aligning it, with a sequence of known function or
structure. Detection of similarity between two different molecular
sequences has led to the discovery of shared phenomena. (We have
already referred to the discovery that the sequence for platelet
derived factor, which causes growth in the body, is 87 identical to
the sequence for v-sis, a cancer-causing gene, which led to the
discovery that v-sis works by stimulating growth.) The quality of a
match between two sequences can be determined by a scoring matrix and
a charge for introducing gaps in one of the sequences to get a better
match, and then a dynamic programming algorithm can be used to
determine the largest number of places where two sequences (gaps
added) agree. The theory of random graphs can be used to compare two
random sequences and predict how good a match one can expect. Work on
determining similarity between pairs of sequences can be expanded to
work on detecting matches among a whole cluster of such sequences, and
then algorithms or heuristics for determining clique-like structures
in corresponding graphs can be useful for finding patterns. In this
workshop, we will explore all of these ideas. We will investigate
dot-matrix methods, global alignments, local alignments and hash
coding methods, multiple alignments, measures of aminoacid similarity,
and statistical significance of alignments. This workshop will also
be closely coordinated with our Algorithm Implementation Challenge.
- Third Workshop: Phylogeny
The study of evolution has become rather more quantitative
since the recombinant DNA revolution. The phylogeny of a set of
species is a model of the evolutionary relationship of that set.
Almost all such models in use can be described as trees or tree-like
structures that are among the most basic structures of discrete
mathematics. The study of phylogenetic tree construction is an
important area of biology, with significant statistical and
algorithmic components. These two facets play off each other since
one is often forced into a tradeoff between the statistically relevant
and the computationally tractable. The workshop will be concerned
with combinatorial, algorithmic, and statistical aspects of
phylogenetic tree construction. We will be interested in finding
parsimonious phylogenies (using either Steiner minimal trees or
minimal spanning trees), with searches through phylogenetic solution
space, with modeling operational taxonomic units as either nodes of
trees or both nodes and branches, with cophenetic correlation measures
of how well a particular tree matches raw data, and with discrete
Fourier methods. We will investigate tree reconstruction based on
genome orders, speciation, fossil record and inference about
divergence times, non-tree structures like those proposed by Dress and
others, and the effects of recombination. We will discuss the methods
vthat are available for tree reconstruction with many different sorts
of molecular data all at the same time -- a very important practical
issue. It is our goal to address strengths and weaknesses of
construction methods (distance methods, parsimony, statistical (such
as maximum likelihood), and invariants); computational issues
involving software and supercomputing applications; and algorithmic
issues; and to set an agenda for future work in computational
phylogeny.
- Fourth Workshop: HIV Sequence Analysis
Although the generic goals of the fourth workshop are similar
to those of the first three, this fourth workshop is different in
character because it will be devoted to a specific applied topical
area rather than a general one. It is concerned with HIV sequence
data, and will have several days devoted to presentations of
experimental data and a much shorter time devoted to methodological
issues. Studies of HIV sequence variation involve global variation,
local variation across individuals, tissue or cell tropism, variation
of the transmitted virus between sexual partners or mothers and
infants, variation of the quasi-species within an infected individual,
and covariation of different sites along the virus. Current methods
for analyzing the available information are not fully understood.
Combinatorial methods have started to play an important role in the
analysis and modeling of HIV sequence data. Among the combinatorial
issues involved here are the theory of random trees, the development
of measures of variability for trees, the study of measures of
distance between trees, and the study of statistical confidence in a
constructed phylogeny. Thus, techniques investigated in our first
three workshops will have some bearing on the subject matter of the
fourth workshop.
- Algorithm Implementation Challenge
Throughout the special year, we will be emphasizing the
development of algorithmic methods for the solution of problems of
molecular biology. One of the central activities will be an Algorithm
Implementation Challenge, in which researchers will be developing
algorithmic methods for a series of benchmark problems developed in
conjunction with molecular biologists. All of the specifics of the
Challenge have not yet been worked out. However, we have decided that
at least part if not all of it will be devoted to problems of DNA
sequence determination from shotgun sequence data. There are several
features to the shotgun sequencing problem that make it appropriate
for the special year. Most important is that it is very relevant to
molecular biology and the genome projects. Scores of groups around
the world are working on this problem, from both theoretical and
applied points of view. Shotgun sequence assembly encompasses aspects
of sequence comparison and multiple sequence alignment and is closely
related to the first two workshops on Mapping and Sequencing and on
Sequence Alignment. Issues of implementation will be stressed at
these two workshops, which will be used to lay the groundwork for the
Challenge. The goal will be to involve a wide variety of computer
scientists in the development of algorithms for carefully defined
problems developed between mathematical scientists and biological
scientists. Intermediate feedback to algorithm developers will be
provided during the course of the year, with comments about both
computational aspects and biological relevance.
- Fifth Workshop: Algorithm Implementation Challenge Workshop
The ``Challenge'' will culminate in a workshop in September of
the second year of our project. Participants will describe their
algorithmic approaches to the challenge problems, implementation
issues which arose will be debated and discussed, and new directions
and further requirements for improvement will be developed.
While our workshops and algorithm implementation challenge
will provide the main scientific focus for the year, a series of
one-day mini-workshops will give us the opportunity to explore a
variety of additional topics. We do not intend to specify in advance
all of the mini-workshop topics. Indeed, we hope that during the
course of the planning for the special year and even during the year,
we will find that some topics are of special importance and merit a
mini-workshop. Mini-workshops can be used to explore topics that
might not be as clearly well advanced in the field of computational
biology or math/cs topics that are of potential but not proven
significance for molecular biology. Also, they can be used to explore
molecular biological topics where the payoff to molecular biology of
math/cs methods may not be so clear.
- Mini-workshop on Combinatorial Structures in Molecular Biology
We are planning on two mini-workshops to be organized by
mathematical scientists and to center around mathematical methods.
One is on combinatorial structures in molecular biology, and will
simply explore a variety of combinatorial questions that relate to or
come from molecular biology. Although biologists will be involved in
the organization of this mini, it is mostly directed at enticing
mathematicians to get involved with the interesting mathematical
problems that arise.
- Mini-workshop on Antibody Sequence and Structure
A second mini organized around mathematical topics will
emphasize combinatorial problem areas in the study of antibody
sequence and structure. This mini will be different from the other
mathematical one in its emphasis on problems arising from a more
narrow area of molecular biological application.
- Mini-workshop on Gene Finding and Gene Prediction
A third mini that is already planned emphasizes methods for
gene finding and gene prediction. It will be concerned with the
increasingly important activity in computational biology of
discovering protein-encoding genes in otherwise uncharacterized
primary sequence data. This has traditionally been done by
discriminating likely coding regions based on a variety of statistical
analyses and to a lesser extent by detection of landmark sequences
such as splice junctions. The methods to be explored will center
around methods of artificial intelligence such as syntactic pattern
recognition (which has recently been useful in the problem of assembly
of exonic regions into plausible genes) and neural nets. There will
also be considerable attention paid to methods of information theory,
signal processing, and dynamic programming. This mini will explore
methods that are perhaps somewhat peripheral to those described in our
summary of the major workshops, but are of increasing interest to
those in the computational biology community. We will discuss the
most recent efforts to discriminate compositional and signal cues from
sequence data and to convey them to integrators, and we will be
comparing frameworks and algorithms for gene assembly and combination
of evidence.
- Mini-workshop on DNA Topology and Regulation
We have planned a mini-workshop on DNA topology and
regulation. DNA in living systems is held by other molecules in a
sequence of loops, within each of which a topological constraint is
imposed. This constraint is the constancy of the linking number, the
number of times either strand of the DNA double helix links through
the loop formed by the other strand. Enzymes regulate the imposed
linking number by actions which involve the transient cutting of one
or both strands, followed by strand passage or rotation and reclosure.
Changes of linking number modulate virtually all the important
biological activities of DNA, including the initiation of gene
expression and of DNA replication. Understanding this phenomenon in
detail is an essential part of elucidating the content and the
functions of the genomic message. The workshop will focus on
topologically constrained DNA and study topological and geometric
properties of DNA loops and computational strategies to analyze and
predict the response of a DNA loop to alterations of its linking
number. Mathematical methods to be explored will include Monte Carlo
techniques and calculation of partition functions, methods somewhat
different than those emphasized in our main workshops, but yet which
are becoming increasingly important.
- Mini-workshop on Global Minimization of Nonconvex Energy Functions
In addition to the Mini-Workshop on Antibody Sequence and
Structure, we are planning several other mini-workshops in the
general areas of protein structural analysis and prediction. While
these have been a major area of computational biology for years, most
approaches to them have used numerical analysis rather than
combinatorial methods. These numerical methods have been applied to
models of the protein at the atomic level with an attempt to optimize
for the minimum energy state. Even within this traditional approach,
some optimization methods related to those of discrete mathematics
have been finding increasing application. Since in almost all cases
the potential energy function is nonconvex and therefore has many
local minimizers, the minimization of the potential energy function is
a very hard problem. For many years, this has been a field of
chemists and physicists, but in recent years, many researchers from
optimization and computer science have paid close attention since
potential energy minimization problems are ideal as test problems for
global minimization algorithms. A mini-workshop on the topic of
global minimization of nonconvex energy functions will give us an
opportunity to create a dialogue between the theoretical computer
scientists developing global minimization algorithms and the
structural biologists (some of whom are in the Rutgers Chemistry
Department) who are interested in protein folding.
- Mini-workshops on Geometrical Methods for Conformational Modeling
and on Sequence-Based Methods for Protein Folding
In the last several years, the focus on protein structure and
folding has begun to shift from physical modeling to pattern matching.
What has happened is that, against initial expectation, the structure
of enough proteins has been learned that there are definite patterns
that are observable and that have great predictive value both for
structure and function. For instance, recently, some researchers have
started exploring some subtasks in protein structure prediction from a
discrete algorithmic point of view. Investigators have had some
initial success on some problems of protein geometry, e.g.,
three-dimensional surface matching, and in predicting or detecting
structure from sequence data, e.g., motif recognition from sequence
data. Thus, the study of protein structure is becoming much more the
study of sequence and critical conserved features from sequence, and
that makes this topic very much a topic where discrete mathematicians
can make contributions. These ideas are far from mature and their
long-term relevance is unclear. Still, a mini-workshop or two will
give us an inexpensive way to explore the area and interchange ideas.
- Possible Mini-workshop on Database Aspects of Biological Data
As we have observed earlier, the existence of libraries of DNA
and protein sequences allows researchers to attempt a variety of
important tasks such as seeking genes that may be similar to a newly
determined sequence. There are many potentially complicated and
daunting problems raised by biological databases. Access problems,
error correction and detection problems, and file management all present
issues of traditional importance in theoretical computer science. We
are exploring the possibility of a mini-workshop on the topic of
database aspects of biological data.
- Possible Mini-workshop on RNA Structure
The three dimensional conformation assumed by RNA is crucial
to the functioning of the cell. In contrast with proteins, RNA bases
pair up to form double stranded helixes, and so much of the three
dimensional structure of RNA can be described by the two dimensional
information about base pairings. There has therefore been a long
tradition of attacking this problem via discrete algorithmic
approaches, since the secondary structure is discrete -- a list of
paired bases -- rather than continuous -- real coordinates of specific
molecules. We are considering a mini-workshop that would explore the
traditional dynamic programming approaches to this problem, as well as
looking at novel algorithmic approaches.
We hope that this discussion has provided an overview of the
scientific topics upon which the special year will concentrate.
Previous DIMACS special years have shown that when we mix different
communities of researchers, we don't know exactly where we will end
up. However, we hope that we have organized the year, given it
sufficient focus, and involved both mathematical and biological
communities in its planning, so that we have a good chance of
attaining the kinds of successes earlier DIMACS special years have attained.
Return to Special Year in Mathematical Support for Molecular Biology Home Page
Return to DIMACS Home Page
Contacting the Center
Document last modified on February 2, 2000.