DIMACS Workshop on Integration of Diverse Biological Data

June 21-22, 2001
DIMACS Center, Rutgers University

Andrea Califano (co-chair),, First Genetic Trust, acalifano@firstgenetic.net
Conrad Gilliam (co-chair), Columbia University, tcg1@columbia.edu
Fred S. Roberts, Rutgers University, froberts@dimacs.rutgers.edu
Presented under the auspices of the Special Focus on Computational Molecular Biology.



Multidimensional Scaling of Massive Data Sets

Dimitris K. Agrafiotis, 3-Dimensional Pharmaceuticals, Inc.

Multidimensional scaling (MDS) is a collection of statistical techniques
that attempt to embed a set of patterns described by means of a
dissimilarity matrix into a low-dimensional display plane in a way that
preserves their original pairwise relationships as closely as possible.
Unfortunately, current MDS algorithms are notoriously slow, and their use
is limited to small data sets. In this paper, we present a family of
algorithms that combine nonlinear mapping techniques with neural networks,
and make possible the scaling of very large data sets that are intractable
with conventional methodologies. The method employs a nonlinear mapping
algorithm to project a small random sample, and then 'learns' the
underlying transform using one or more multi-layer perceptrons. The
distinct advantage of this approach is that it captures the nonlinear
mapping relationship in an explicit function, and allows the scaling of
additional patterns as they become available, without the need to
reconstruct the entire map. A novel encoding scheme is described, allowing
this methodology to be used with a wide variety of input data
representations and similarity functions. The advantages of this approach
are illustrated using examples from the field of combinatorial chemistry.
It is shown that in the case of combinatorial libraries, it is possible
predict the coordinates of the products on the nonlinear map from pertinent
features of their respective building blocks, and thus limit the
computationally expensive steps of virtual synthesis and descriptor
generation to only a small fraction of products. In effect, the method
provides an explicit mapping function from reagents to products, and allows
the vast majority of compounds to be projected without constructing their
connection tables.


Cluster Analysis and the Development of a Multi-Genome Database of Tandem Repeats

Gary Benson, The Mount Sinai School of Medicine

This research deals with problems involved in the on-going development of a
multi-genome database of tandem repeats (TRDB). A tandem repeat is an
occurance of two or more adjacent, often approximate copies of a
sequence of Nucleotides. Tandem repeats 1) are known to cause or be associated
with a variety of human diseases, 2) can cause phenotypic variation or
loss-of-function in protein and 3) can modify gene expression possibly by
adopting unusual structural conformations or by acting as transcription factor
binding sites. Tandem repeats are the primary component of major chromosomal
structures including the centromere, telomere and heterochromatin, and are
involved in chromosome condensation. Because tandem repeats often exhibit copy
number polymorphism, they are useful markers for genetic linkage analysis, DNA
fingerprinting and evolutionary studies.

While tandem repeats form one of the major classes of repeats in genomic DNA,
information about them remains incomplete, fragmented and difficult for
researchers to access. Development of TRDB involves generating primary
information about the repeats and collecting ancillary annotation
information. Central issues of TRDB development are:

  • Clustering of repeats into families. A family is a set of repeats which have similar patterns but occur in different genomic locations or in different genomes. We discuss partition type clustering based on distance between profile representations of the repeats. A profile is a sequence of discrete distributions. The sequence has length equal to the fundamental pattern size of the repeat and each distribution represents the frequency of the four nucleotides in a column of the aligned repeat copies. Use of a profile representations has necessitated the development of new distance measures for comparing discrete distributions.
  • Functional clustering. Complementary to profile clustering is clustering based on functional properties of the repeats. Functional data are generated by new sequence analysis methods, accessing existing data sources for annotation features and biological experiment. Functional data includes, for individual repeats: 1) 'genomic environment': adjacent genes and localization to intron, exon, untranslated or intergenic regions, 2) known or predicted copy number polymorphism, and 3) potential transcription factor binding sites; for repeat families: 4) internal homogeneity of the family, 5) distribution in the genome, 6) similarity of flanking sequence, and 7) association with protein families.
  • Integrated data visualization and selection (IDVS) tool. To assist in the validation of clustering, funtions of this tool include 1) a query system to select repeats/families and associations/properties based on the predefined data requests and, 2) the creation of views for repeat, family and genome properties. 3. Functional Classification of Protein Families by Top-down Clustering from Sequence and Structural Data Andrea Califano, First Genetic Trust Given a set of proteins, two important problems in biology are the inference of biologically and functionally related subsets and the identification of functional regions and residues. The former is typically performed by unsupervised, bottom-up clustering, using sequence similarity as a measure of relatedness. The latter is typically performed as an independent step, starting from protein sets determined a-priori, either manually or computationally. Semantically, however, the two processes are inextricably linked, since protein families are usually characterized by corresponding functional regions and residues. This paper introduces a high-performance, unsupervised clustering system that accomplishes both tasks simultaneously. Potential functional regions, inferred using the SPLASH pattern discovery algorithm, are first filtered using statistical criteria and then used to determine functionally related protein subsets. To achieve increased accuracy and sensitivity, the regular expression patterns discovered by the algorithm are converted into more sensitive and accurate profile Hidden Markov Models (HMM). This is the first reported system where potential functional regions are exhaustively and automatically identified from a set of unrelated sequences. The inference of functional relationship is performed via a general and flexible model which integrates both sequence and structural information. The resulting classification system is organized into structures of varying complexity, ranging from a tree to an acyclic graph. Since the relationships correspond to conservation of functional regions these structure are expected to be representative of the functional relationships formed throughout the evolutionary processes. To test the system's ability to deal with complex taxonomies, comparative results on the G-Protein Coupled Receptor (GPCR) superfamily are reported. This includes more than 150 functionally independent subfamilies. As shown in the results, the amino acids that are highly conserved in the discovered patterns are very likely to correspond to functional residues. Several hundred functional residues reported in the literature, based on mutagenesis experiments, have been analyzed in the context of the reported patterns. This shows that the system can be used as a highly predictive aid to the planning of this type of experiments. 4. Modeling Tumors as Complex Dynamic BioSystems Thomas Deisboeck, Harvard Medical School There is growing evidence that malignant tumors behave as complex dynamic biosystems rather than as unorganized cell masses. If this is true, tumors need to be experimentally studied and ultimately treated as such systems. This requires the integration of multi-modality data sets. A promising approach is the crossdisciplinary combination of novel experimental assays and computational modeling. The talk will describe the concept leading to the development of such an experimental device as well as its input in computer visualizations and simulations using cellular automata and agent-based modeling. Potential clinical applications of this ongoing work will be discussed. 5. Functional Characterization in the Post-Genomic Era by Means of Declarative Query Access to Diverse Data and Applications (Joint work with A.S. Kosky, Gene Logic, Inc. and L.A. Laroco, Jr., GlaxoSmithKline) Barbara A. Eckman, GlaxoSmithKline To perform functional characterization in genomic sequence it is necessary to integrate data from a variety of locations (within an organization or across the Internet) in a variety of formats (traditional databases, flat files, web sites). In addition to simply retrieving data, as in traditional DBMSs, it is necessary to perform specialized data analysis to discover patterns of biological interest. Integrating arbitrary analysis with query execution permits filtering, organizing, and enhancing data retrieved by wide-ranging multi-database queries, and increases data mining efficiency by enabling analyses to be performed only on datasets of interest. TINet (Target Informatics Net) is a readily extensible data integration system developed at GlaxoSmithKline (GSK), based on the Object-Protocol Model (OPM) multi-database middleware system of Gene Logic, Inc. Data sources currently integrated include: the Mouse Genome Database (MGD) and Gene Expression Database (GXD), GenBank, SwissProt, PROSITE, PubMed, GeneCards, and GSK proprietary relational and SRS databases. Analytic tools are integrated either as data source servers (e.g., runtime BLAST and GCG motifs searches) or as special-purpose class methods (e.g., regular expression pattern-matching over BLAST HSP alignments and retrieving partial sequences derived from GenBank primary structure annotations). All data sources and methods are accessible through an SQL-like query language or a GUI, so that when new investigations arise no additional programming beyond query specification is required. The power and flexibility of this approach are illustrated in such integrated queries as: 1) "Find homologues in genomic sequence to all novel genes cloned and reported in the scientific literature within the past three months that are linked to the MeSH term 'neoplasms'"; 2) "Using a neuropeptide precursor query sequence, return only HSPs where the target genomic sequences conserve the G?[KR][KR] motif at the appropriate points in the HSP alignment"; and 3) "Of the human genomic sequences annotated as channels having exon boundaries in GenBank, return only those with valid putative donor/acceptor sites and start/stop codons". 6. Integrative Genomics: Surveys of a Finite Parts List (Joint workwith P. Harrison, J. Qian, V. Alexandrov, P. Bertone, R. Das, D. Greenbaum, R. Jansen, W. Krebs N. Echols, J. Lin, C. Wilson and A. Drawid) Mark Gerstein, Yale University My talk will focus on analyzing genomes and functional genomics data in terms of the finite list of protein "parts". I use the term "part" rather broadly, and depending on context, it can either be a protein fold or family. I will touch on SOME of the following topics: (i) How one can compare different genomes in terms of the occurrence of parts. (ii) How one can do the exact same operation on the pseudogenome -- the total complement on pseudogenes in an organism. (iii) How this idea can be further extended to compare the representation of parts in the genome versus the transcriptome. References P Harrison , N Echols , M Gerstein (2001). "Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans Genome." Nuc. Acids. Res. (in press). A Drawid , R Jansen , M Gerstein (2000). "Genome-wide analysis relating expression level with protein subcellular localization." Trends Genet 16: 426-30. A Drawid , M Gerstein (2000). "A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome." J Mol Biol 301: 1059-75. J Lin , M Gerstein (2000). "Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels." Genome Res 10: 808-18. R Jansen , M Gerstein (2000). "Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins." Nucleic Acids Res 28: 1481-8. M Gerstein (1998). "Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census." Proteins 33: 518-34. 7. Application for Support Vector Machine to detect an association between a disease or trait and multiple SNP variations. MyungHo Kim, Genomics Collaborative, Inc. After the completion of human genome sequence was anounced, it is evident that interpretation of DNA sequences is an immediate task to work on. For understanding their function and signals, improvement of present sequence analysis tools and developing new ones become necessary. Along this current trend, we attack one of the fundamental questions, which set of SNP(single nucleotide polymorphism) variations is related to a specific disease or trait is. For, in the whole DNA sequence, it is known that people have different DNAs only at SNP locations, and moreover, the total SNPs are less than 5 millions, finding an association between SNP variations and certain disease or trait is believed to be one of the essential steps not only for genetic researches but for drug design and discovery. In this paper, we are going to present a method of detecting whether there is an association between multiple SNP variations and a trait or disease. Here is the basic scheme.
  • 1. Assume that there is no environmental factor.
  • 2. Suggest a vector representation of multiple SNP variations.
  • 3. Apply the Support Vector Machine, which has been attracting lots of attentions recently. 8. A Bayesian Framework for Combining Gene Predictions (Joint work with V. Pavlovic, A. Garg and S. Kasif) Pedro Moreno, Cambridge Research Laboratory Gene identification and gene discovery in new genomic sequences is one of the most timely computational questions addressed by bioinformatics scientists. This computational research has resulted in several systems that have been used successfully in many whole-genome analysis projects. As the number of such systems grows the need for a rigorous way to combine the predictions becomes more essential. In this presentation we provide a Bayesian network framework for combining gene predictions from multiple systems. The framework allows us to treat the problem as combining the advice of multiple experts. Previous work in the area used relatively simple ideas such as majority voting. We describe the application of a family of combiners of increasing statistical complexity. In particular, we introduce, for the first time, the use of Hidden Input/Output Markov models for combining gene predictions. We apply the framework to the analysis of the Adh region in Drosophila that has been carefully studied in the context of gene finding and used as a basis for the GASP competition. Our preliminary results suggest that the probabilistic network solution appears promising resulting in a significant improvement in exon level accuracy vs the best single predictor. The main challenge in combination of gene prediction programs is the fact that the systems are relying on similar features such as codon usage and as a result the predictions are often correlated. We show that our approach is promising to improve the prediction accuracy and provides a systematic and flexible framework for incorporating multiple sources of evidence into gene prediction systems. We also note that the approach we described is in principle applicable to other predictive tasks such as promoter or transcription elements recognition and/or combining different sources of functional genomics data. 9. Gene finding in Eukaryotes Mihaela Pertea, Johns Hopkins University and The Institute for Genomic Research The gene finding research community has focused considerable effort on human and bacterial genome analysis. This has left some small eukaryotes without a system to address their needs. We focused our attention on this category of organisms, and designed several algorithms to improve the accuracy of the gene finding detection for them. We considered three alternatives for gene searching. The first one identifies a coding region by searching signals surrounding the coding region. This technique is used by GeneSplicer - a program that predicts putative locations for the splice sites. A second alternative is to identify a protein region by analyzing the nucleotide distribution within the coding region. Complex gene finders like GlimmerM combine both the above alternatives to discover genes. The third alternative carefully combines the predictions of existing gene finders to produce a significantly improved gene detection system. GeneSplicer is a new, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been tested successfully using DNA from two reference organisms: the model plant Arabidopsis thaliana and human. It was compared to six programs representing the leading splice site detectors for each of these species: NetPlantGene, NetGene2, HSPL, NNSPLICE, GENIO and SpliceView. In each case GeneSplicer performed comparably to the best alternative, in terms of both accuracy and computational efficiency. The basis of GlimmerM is a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model, and chooses the best of these combinations. The decision about what gene model is best is a combination of the strength of the splice sites and the score of the exons produced by an interpolated Markov model (IMM). The system, which is freely available at http://www.tigr.org/softlab, has been trained for Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa (rice), and should also work well on closely related organisms. We developed a combiner algorithm that gains from the diversity of three or more gene finders. The combiner was tested on three gene finders developed specifically for the Arabidopsis genome: GENSCAN, GeneMark.HMM and GlimmerA - the GlimmerM version for A. thaliana. These gene finders are the result of years of development, and improving upon these systems is quite difficult. The combiner algorithm not only succeeds at this, but it also offers a real possibility of further improvements if and when the underlying gene finders are improved. 10. Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements (Joint work with P. Sudarsannam and G.M. Church) Yitzhak Pilpel, Harvard University Abstract: The recent availability of microarray data has led to the development of several computational approaches for studying genome-wide transcriptional regulation. However, few studies have addressed the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. We have developed a new computational method that analyzes microarray data to discover synergistic motif combinations in the promoters of S. cerevisiae. Our method suggests causal relationships between each motif in a combination and the observed expression patterns. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation, and various stress response conditions, we have also discovered regulatory cross-talk between several of these processes. We have generated motif synergy maps that provide a global view of the transcription networks in the cell. The maps are highly connected suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach should be important for modeling transcriptional regulatory networks in more complex eukaryotes. 11. Unweaving Regulatory Networks: Automated Extraction from Literature and Statistical Analysis Audry Rzhetsky, Columbia University Abstract: In the first part of the talk I will describe our on-going effort to built a natural language processing system extracting information on interactions between genes and proteins from research articles. In the second part of the talk I will introduce an algorithm for predicting molecular networks from sequence data and stochastic models of birth of scale-free networks. 12. An integrative platform for expression and sequence data (Joint work with S. Bergling, I. Crignon, U. Dengler, S. Grzybek, J. Lange. J. Rahuel, M. Reinhardt and J. Zhue) Sven Schuierer, Novartis Experimental high-throughput techniques such as sequencing and protein and microarray expression experiments are creating an enormous amount of data. The rapidly changing environment of genomic research involves a wide range of technologies which lead to a large number of heterogeneous data sources and access methods. System integration is essential in this environment. To meet this challenge we have developed the integrative platform DEMON (Differential Expression & Annotation Monitor) which combines expression and sequence data in one data base. The core features of DEMON are:
  • a generic interface to data and algorithms
  • a flexible, open architecture
  • a scheduling mechanism for automated high-throughput data processing, e.g. for the analysis of sequence- and expression-data. DEMON provides tools for the analysis of microarray expression which are linked to sequence annotation and sequence classification information. The sequence annotation is pre-computed which gives the users immediate access to the results of a number of different sequence similarity searches. Furthermore, DEMON allows to query different sequence types in a uniform manner by building a common coordinate system to which related DNA, RNA and protein sequences are mapped. 13. NSF Funding Availability For Data Integration Efforts Sylvia Spengler, National Science Foundation The National Science Foundation has a variety of opportunites for individuals seeking support for data integration activities. These include a variety of cross-directorate activities as well as Programs in BIO and CISE. I will give an overview of the opportunies as well as discuss ways to make future calls rapidly available. 14. What can be "learned" from gene expression arrays? Gustavo Stolovitzky, IBM One important application of gene expression arrays is functional annotation. When cells are treated under different conditions, genes will change their profile of expression according to their cellular role, and in principle this profile can be learned using machine learning techniques.When these algorithms are used along the prior knowledge of gene function, one might expect to learn the expression signatures of different functional classes. Support Vector Machines and other machine learning algorithms have been applied for this purpose [Brown et al., PNAS 97, 262-267 (2000)], and this work will be reviewed. We have explored the use of a supervised learning scheme that uses artificial neural networks (NN) for the purpose of functional annotation. We considered 100 functional classes catalogued in the MIPS (Munich Information Center for Protein Sequences) database, and attempted to learn their signature, using the gene expression data previously used by Eisen [PNAS 95, 14863-14868 (1998)]. We found that only a small subset (less that 10%) of these functional classes can be learned. We explored the features that make a class "learnable". For one of the best learned classes, the TCA cycle, we did a systematic analysis of the False Positives and False Negatives arising from a cross-validation scheme, and found that they can be accounted for in terms of metabolic pathways related to the TCA cycle using the KEGG database of biochemical pathways. 15. A Heart Failure Knowledgebase Combining Experimental Data with Tools for Integrative Biological Modeling (Joint work with W. Baumgartner Jr., P. Helm, D. Scollan, C. Yung and T. Suzek) Raimond L. Winslow, The Whitaker Biomedical Engineering Institute Center for Computational Medicine and Biology, and Department of Computer Science Heart failure , the most common cardiovascular disorder, is characterized by ventricular dilatation, decreased myocardial contractility and cardiac output. Prevalence in the general population is over 4.5 million, and increases with age to levels as high as 10%. New cases number approximately 400,000 per year. Patient prognosis is poor, with mortality roughly 15% at one year, increasing to 80% at six years subsequent to diagnosis. It is now the leading cause of Sudden Cardiac Death in the U.S., accounting for nearly half of all such deaths. For the past six years, we have worked with experimental colleagues in the NIH-funded Specialized Center of Research in Sudden Cardiac Death to achieve a more comprehensive understanding of the origins and treatment of heart failure. We have done so by undertaking a range of experimental studies which include: a) large-scale measurement of mRNA levels in cardiac tissue; b) patch-clamp and whole-cell recording in individual isolated myocytes; c) imaging of cardiac micro-anatomical structure; and d) electrophysiological recordings of electrical activity; in both normal and failing hearts. At the same time, we have formulated computational models, ranging from the level of individual ion channels to single cells and whole heart, and have used these models to investigate the relationship between altered patterns of gene expression and mechanisms of arrhythmia in heart failure. In this talk, we will describe how computational models have been applied to reach specific conclusions regarding the mechanisms by which life-threatening arrhythmias arise in heart failure. We will also describe a more general conclusion emerging from this work - that heart failure is a complex disease characterized by changes in expression of hundreds of genes involved in many different cellular sub-systems. In our view, understanding the functional significance of these changes is a challenging problem that will require the development of a heart failure knowledgebase comprised of: a) regional gene expression data; b) regional cellular electrophysiological data; c) cardiac micro-anatomical data; and d) whole-heart electrical mapping data; obtained from normal versus failing hearts. In addition, an interface that supports the exploration of these diverse data sources for the purpose of model development is required. We will describe our initial efforts at creating key components of this heart failure knowledgebase. (supported by NIH HL60133, the NIH Specialized Center of Research on Sudden Cardiac Death P50 HL52307, the Whitaker Foundation, and IBM Corporation, Inc.).

  • Previous: Participation
    Next: Registration
    Workshop Index
    DIMACS Homepage
    Contacting the Center

    Document last modified on June 11, 2001.