This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).
Title: Modeling the dynamics and function of cellular interaction networks
Interaction between gene products forms the basis of essential processes like signal transduction, cell metabolism or embryonic development. Recent experimental advances helped uncover the structure of many cellular networks, creating a surge of interest in the dynamical description of gene regulation. Traditionally genetic and protein interactions are modeled by differential equations based on reaction kinetics, but these studies are greatly hampered by the sparsity of known kinetic detail. As an alternative, qualitative models assuming a small set of discrete states for gene products, or combinations of discrete and continuous dynamics, are gaining acceptance. Many results also suggest that the interaction topology plays a determining role in the dynamics of regulatory networks and there is significant robustness to changes in kinetic parameters. In this presentation I will explore models of the gene regulatory network governing the segmentation of fruit fly embryos, and of the signal transduction network regulating drought response in plants. Each model is able to give predictions and insights into its respective biological process, and illuminates the emergent (network-level) functional robustness of cellular regulatory networks.
Title: Ab initio predictions of transcription factor binding
Gene regulation depends on transcription factor proteins that recognize specific DNA sequences. We are developing computational methods to calculate the recognized DNA sequence purely from the transcription factor protein sequence and homology modeling of the protein-DNA complex. Predictions are ab initio in that they do not rely on analysis of promoters of presumably co-regulated genes. The long-term goal of this project is to predict the regulatory protein- DNA interactions of an organism purely from its genome sequence. We will describe initial results with homeodomain and leucine zipper protein families.
Title: Protein Subfamily Classification Methods for NCBI's Conserved Domain Database
NCBI's protein classification project aims to provide a comprehensive set of multiple alignments corresponding to ancient domains with conserved functions. The project exploits the recent availability of information from extensive whole-genome sequencing and 3D protein structure determination from structural genomics projects. Protein 3D structures are used to identify the conserved "core" substructures of protein domain families and to specifically identify sets of homologous sites to be used in phylogenetic tree calculations. Sequences from diverse organisms with sequenced genomes are systematically included in the alignments so that the evolutionary "age" of conserved groups may be estimated by the presence of "sentinel" taxonomic groups. The project's overall goal is to provide structure-based alignments and associated search reagents for all domain orthology groups originating by gene duplications approximately 500 million or more years in the past.
This Conserved Domain Database (CDD) classification project also differs from related efforts in its organizational approach. As in other projects, expert biologist-curators abstract functional annotation from the scientific literature, maintaining citations to relevant articles. In addition, however, CDD curators construct structure-based multiple alignments and annotate interaction sites observed in structural complexes using a software tool, "Cn3D", developed specifically for this purpose by the CDD bioinformatics team. CDD curators furthermore identify and annotate ancient subfamilies using a phylogenetic analysis tool, "CDTree", developed specifically for the project. CDD alignment hierarchies consist of parent and child subfamilies, nested to any depth, where strict consistency of aligned residues is checked and maintained by software throughout the curation process. This organization of alignment information allows curators to incorporate new sequences and 3D structures without loss of existing functional-site and subfamily annotation, while also allowing modification of alignments and subfamilies based on the new information.
To date CDD curators have completed structure-based alignment and ancient orthology-group classification for approximately 3000 families and subfamilies. While the project goal of a comprehensive classification remains for future work, some conclusions can be drawn from progress to date. One finding is that use of 3D structure and structural similarity invariably introduces significant changes in alignments, as compared to multiple alignments based on sequence information alone. Another finding is the very the large number of ancient subfamilies identified. This number is far greater than number of families identified by aggressive sequence-similarity clustering, and to a first approximation CDD can be characterized as the "splitting" of many ancient protein domain families into subfamilies with distinct functions. CDD search services are freely available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml and are currently accessed by approximately 20,000 users daily.
Title: Towards the prediction of residues involved in the folding nucleus of proteins
Representation of a protein on a non-cubic lattice coupled with a Monte-Carlo dynamic algorithm allows obtaining essential topological information during the folding process. Although precise information on the native structure cannot be reached in the frame of this model, it gives useful information on the residues having a strong tendency to be buried since the first stages of folding. Actually, some residues have a mean number of neighbors higher than the mean value along the polypeptide chain. They are referred to as Most Interacting Residues (MIR) and they mainly correspond to residues forming the folding nucleus. They are statistically correlated to structural characteristics such as the ends of closed loops, as well as to sequence markers like (fingerprints and) amino acid triplets. Compiling results of the above different approaches should lead to a finer prediction of the folding nucleus. Application of the MIR approach to fragments and point mutations related to amyloid fibril formation shows a strong correspondence between MIR positions and the critical regions.
Title: Deconstructing Multi-domain Homology
Multi-domain sequences represent both an intriguing evolutionary process and an impediment to automated genomic analyses. Duplication and rearrangement of functional subunits, called domains, is Nature's equivalent of rapid prototyping. Genomic evidence suggests that the functional possibilities resulting from the combinatorial explosion of possible domain arrangements offers special evolutionary opportunities: Multi-domain families played a key role in the evolution of multicellularity and expanded dramatically in vertebrates, where they are prevalent in tissue repair, cell death and the immune system. However, the modularity that gives multi-domain sequences their functional versatility also challenges traditional sequence analysis methods. I will discuss how the structure of the sequence similarity network can be used to ask questions about homology and evolution in multi-domain families.
Title: Evolution of Protein Domain Architectures
Our goal is to develop the first coherent, large-scale model to explain when and how extant protein domain architectures arose and how they changed over time. We establish how architectures undergo fissions, fusions, or other recombination events to produce new ones and show that fusion of existing domains is the primary driver in developing new architectures. Our model characterizes over 85% of known domain architectures and illustrates the evolution of architectures in over 150 of the most-well studied bacteria, archaea, and eukaryotic species. We can use this model to better determine phylogenetic relationships between species and to further explore how the function and structure of related proteins vary across organisms.
Title: Understanding Protein Function on a Genome-scale using Networks
My talk will be concerned with topics in proteomics, in particular predicting protein function on a genomic scale. We approach this through the prediction and analysis of biological networks -- both of protein-protein interactions and transcription-factor-target relationships. I will describe how these networks can be determined through Bayesian integration of many genomic features and how they can be analyzed in terms of various simple topological statistics.
Joint Work with S Balasubramanian, O Emanuelsson, C Goh, J Karro, L Lu, N Luscombe, D Milburn, J Rozowsky, Y Xia, Z Zhang, P Bertone, H Yu, Y Kluger, Y Liu, J Qian, P Harrison, R Jansen, N Echols, MB&B Dept. Yale University
Title: Remote Homology Inference: What are the limits?
Approaches integrating sequence, structure and functional information with evolutionary considerations have been proven to be most efficient for understanding weak similarities between proteins. Several examples of remote homology inference using combination of computational methods will be discussed. In particular, power of transitive sequence similarity searches in reliable detection of homologs at close to and below random sequence identity will be illustrated. Pairs of proteins with statistically supported sequence similarity that adopt different structural folds will be shown.
Title: Unifying measures of gene function and evolution
Recent genome analyses revealed many intriguing correlations between variables characterizing the functioning of a gene, such as expression level, connectivity of genetic and protein-protein interaction networks, and knockout effect, and variables describing gene evolution, such as sequence evolution rate and propensity for gene loss. Typically, variables within each of these classes are positively correlated, e.g., products of highly expressed genes also have a propensity to be involved in many protein-protein interactions, whereas variables between classes are negatively correlated, e.g., highly expressed genes, on average, evolve slower than weakly expressed genes. I describe principal component (PC) analysis of 7 genome-related variables and propose biological interpretations for the first three principal components. The first PC reflects a gene's "importance", or the "status" of a gene in the genomic community, with positive contributions from knockout lethality, expression level, number of protein-protein interaction partners and the number of paralogs, and negative contributions from sequence evolution rate and gene loss propensity. The next two PCs define a plane that seems to reflect the functional and evolutionary plasticity of a gene. Specifically, PC2 can be interpreted as a gene's "adaptability" whereby genes with high adaptability readily duplicate, have many genetic interaction partners and tend to be non-essential. PC3 also might reflect the role of a gene in organismal adaptation albeit with a negative rather than positive contribution of genetic interactions; we provisionally designate this PC "reactivity". The interpretation of PC2 and PC3 as measures of a gene's plasticity is compatible with the observation that genes with high values of these PCs tend to be expressed in a condition- or tissue-specific manner. Functional classes of genes substantially vary in status, adaptability, and reactivity, with the highest status characteristic of the translation system and cytoskeletal proteins, highest adaptability seen in cellular processes and signaling genes, and top reactivity characteristic of metabolic enzymes.
Title: Analysis of schematic representations of protein folding patterns
In order to study the distribution of topological features across known protein folding patterns, we have analyzed a representation designed to abstract the essential properties of protein folding patterns, in a form intelligible to both humans and computers. We have created a database and mined it to determine: (1) statistical properties of contacts between secondary structural elements in a unbiased set of protein domains of known structure, from the ASTRAL compilation, and (2) the use of the representation in fold identification, with interesting implications about the extent to which the local structure of proteins determines the full topology. The results reported are based on work with Mr A Kamat.
Title: Known function to new proteins and new function to known proteins: The case of toxin-like proteins
The power of assigning known function to new proteins had been tested at a genome level. The recently sequenced honey bee genome has produced ~10,000 predicted protein sequences. We have applied an unsupervised protein clustering method to the honey bee genome jointly with other selected complete proteomes. The 6000 families that were created provided a rich evolutionary trace on gene loss and gene gain in the bee. Applying an annotation inference method, we were able to provide functional annotation for 76% of the sequences. We then tested the ability to assign new high level functions that are unclassifiable by sequence-based methods. We used the toxin like proteins as a test case. Toxins are proteins that are extremely varied in their structure and biochemical function. We constructed a toxin classifier to a non-redundant set of ~30000 proteins and had detected instances of known toxins and toxin-like proteins as well as new functions in genomes. We will illustrate the discovery of new toxin like proteins in insect and mammalian brain. Our work can be applied for other high-level functional classification.
Title: Where in the structure is protein function encoded?
A traditional view is that a protein needs to be stable and to have an intact active site to function. Amino acids of the hydrophobic core and of the active site are expected to be conserved in evolution to preserve protein's function, others, especially surface loops are more variable and believed to be less important
We argue that tunable and specific function of a protein requires more than stability and conserved active/binding site. By examining recognition mechanism of DNA-binding proteins, we demonstrate that flexibility of the protein is essential for protein function. We show that coupling between folding and binding allows a protein to find its target fast and bind it tightly. Using comparative genomics, we identify functionally important residues that are not conserved among homologs. Our results expand traditional paradigm of function and conservation by showing that (i) protein function can be spread over large regions of protein structure, requiring their low stability; and (ii) certain functional sites can appear as non-conserved in evolution.
Title: Assigning protein function from sequence and from structure
In this talk, I will describe three quite different techniques that we have developed recently for analyzing proteins.
The first is an extension of support vector machine (SVM)-based methods for protein functional annotation to allow for heterogeneous data. These data may include multiple representations of the protein sequence, plus other data sets such as gene expression, protein localization, and protein-protein interactions. The method uses semidefinite programming techniques to find a globally optimal SVM solution in the presence of these diverse types of data.
The second method is a multi-class classification method based on output codes. Our method learns relative weights between one-vs-all classifiers and, in the context of protein classification, encodes information about the class hierarchy for multi-class prediction. This code weighting approach significantly improves on the standard one-vs-all method for the fold recognition problem.
The third technique predicts Gene Ontology terms directly from protein structures. In this work, we compare a variety of structured-based SVM classifiers, and show that a simple representation based upon global structural alignment of the protein backbone provides superior classification performance, compared to sequence-based approaches and compared to previously described structure-based methods.
Title: Exploiting Structural and Comparative Genomics to Reveal Protein Functions
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries.
The CATH domain structure database now contains ~70,000 domain structures which can be mapped to between 40-60% of the domain sequences in completed genomes depending on the organism. The Gene3D resource combines information on CATH structural domains with functional domains from Pfam and newly identified families (NewFams), to provide information on domain compositions for sequences in completed genomes. This data enables more reliable inheritance of function across domain families and can also be used to chart the evolution of functions in these families using phylogenetic approaches. Further, by exploiting the structural data we can understand the mechanisms by which novel functions evolve in families. Correlations between phylogenetic occurrence profiles for domain structure families, derived at different levels in the CATH classification hierarchy, can be used to identify families that are likely to be functionally associated in some way such as through protein-protein interactions or functional complexes.
Joint work with Juan Garcia-Ranea, Corin Yeats, Sarah Addou, David Lee and Christine Orengo
Title: Protein cores and loops: important clues to protein structure and function evolution
Two proteins are considered to have a similar structural core if sufficiently many of their secondary structure elements are positioned similarly in space and are connected in the same order. On the other hand, the intervening regions (loops) between the superposable helices and strands can exhibit a wide range of similarity and their analysis may offer clues to the protein classification, evolution of structure and function. Loop-based structural similarity measures have been shown to improve structural classification of homologous proteins and loop-based distance matrices often produce more reliable protein family classifications compared to the conventional measures of sequence and structural similarity. At the same time, analysis of loop length distribution in relation to protein function and taxonomic diversity allows to draw important conclusions about the protein size evolution and the history of indel events. Finally, a question will be addressed of how the protein structure changes in its conserved aligned core regions and unaligned loop regions as proteins diverge from a common ancestor.
Title: Towards uncovering dynamics of protein interaction networks
A major challenge in systems biology is to understand the intricate network of interacting molecules. The complexity in biological systems arises not only from various individual protein molecules but also from their organization into protein complexes and functional modules. Recently, several groups developed computational methods to identify network modules and/or protein complexes in protein-protein interaction networks. In this talk I will describe our contribution towards the next step in analysis of protein-protein interaction networks - recovering temporal relation and overlaps between functional groups.
We develop a graph-theoretical framework which allows elucidating functional groups together with temporal relations between them. We apply our approach is to delineate pheromone signaling pathway from the high throughput protein-protein interaction network.
Title: Evolution teaches protein function prediction
Our long-term goal is the contribution to the development of a comprehensive system that models three aspects of function particularly in multi-cellular organisms: where is a molecule located most of the time, what does it interact with, and when does the interaction occur. In my talk, I will focus on three particular topics pertaining to the first of these three objectives. Firstly, I will describe one of our methods for the de novo prediction subcellular localization. Secondly, I will sketch our first steps toward de novo predictions of protein-protein and protein-DNA interactions. Lastly, I will briefly sketch the idea behind our most recent, preliminary improved method for database searches employed for the selection of targets in the context of large-scale structural genomics. One goal of structural genomics is to determine one experimental protein structure for each representative protein family. Experimental structures have been determined for over 200 of the proteins selected by us by our experimental colleagues. In this context, I shall present some of our recent findings about the organization of sequence/structure space.
Title: Analyzing and interrogating protein interaction maps using network schemas
High-throughput technologies have led to proteome-scale interaction maps for several organisms. Computational analysis of these networks can reveal important principles of cellular organization and help uncover protein function. Towards such analysis, we introduce pathway schemas to specify recurring means with which biological processes are carried out. In this talk, I will discuss our approach to uncover and query network schemas within interaction graphs. We show that network schemas can be used to uncover functionally cohesive subnetworks of interest.
Title: A structure-centric view on adaptation, evolution of protein structure and function
In this presentation I will first review recent discoveries of global patterns in protein universe presented as a network of similarities (protein Domains Universe Graph, or PDUG) - scale free behavior, clustering and phenomenological models that explain it in terms of divergent evolution. The clusters of PDUG carry strong functional signals - functional fingerprints of folds - that are results of divergent coevolution of structure and function. Further I will present a novel microscopic model of organismal evolution with explicit relation between genotype - stability of proteins that can be evaluated exactly for the model and phenotype - death rate of an organism. I will show how organismal evolution gives rise to protein universe that possesses all highly non-trivial properties of real protein universe, such as scale-free organization, high-order correlations and peculiar clustering of the nodes.
Title: Structural phylogenomic inference of protein function
Phylogenomic inference of protein function has been demonstrated to improve the accuracy of functional annotation. My lab, the Berkeley Phylogenomics Group, combines phylogenetic tree construction with protein structure prediction and analysis to elucidate how changes in structure are correlated with changes in protein function. In this talk, I will present several methods developed in my lab for structural phylogenomic analysis, including (1) clustering proteins into global homology groups as the basis of phylogenomic inference (FlowerPower), (2) simultaneous sequence alignment and tree construction using hidden Markov models (SATCHMO), (3) an information theoretic method for identification of functional subfamilies (SCI-PHY), (4) subfamily hidden Markov models for classification of novel sequences to functional subfamilies, and (5) new methods for identifying subfamily-specificity positions and functional epitopes in proteins using information from phylogenetic analysis, 3D structure and multiple sequence alignment. Finally, I will present the Universal Proteome Explorer, a phylogenomic library with over 7,000 "books" for protein families and structural domains for automated phylogenomic classification at the genome level.
Title: Quality and effectiveness of protein structure models
Computational models of protein structures can be extremely useful in biomedical sciences. I will discuss two specific applications of modelling techniques: molecular replacement in x-ray crystallography and function prediction.
As far as molecular replacement is concerned, I will describe a procedure for quickly assessing whether a protein structure can be solved by molecular replacement.
I will also report the results of an analysis of the functional predictions submitted to the world wide CASP experiment. We revisited the results of the experiment and demonstrated that predictions deriving from consensus of different methods can reach an accuracy as high as 80%. It follows that some of the predictions submitted to CASP6, once re-analysed taking into account the type of converging methods, can provide very useful information to experimentalists interested in the function of the target proteins.
Title: Predicting protein function and interactions on a whole-genome scale
Understanding gene function and protein behavior within the biological networks is a key challenge in modern systems biology. Broad availability of diverse functional genomic data should enable fast and accurate generation of network models through computational prediction and experimental validation. I will discuss our recent work on prediction of protein function from diverse genomic data with the aid of the Gene Ontology structure. I will also introduce a probabilistic system we developed for discovery and validation of biological process-specific networks based on diverse heterogeneous data. Using these approaches, we have modeled multiple known processes in Saccharomyces cerevisiae, characterized unknown components in these processes through computational predictions and experimental validation, and identified novel cross-talk relationships between biological processes.