DIMACS Workshop on Information Processing by Protein Structures in Molecular Recognition

Dates: June 13 - 14, 2005
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Bhaskar DasGupta, University of Illinois at Chicago, dasgupta@cs.uic.edu; Jie Liang, University of Illinois at Chicago, jliang@uic.edu

Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).

Abstracts:

Brian Chen, Rice University

Title: Algorithms for Structural Comparison and Statistical Analysis of Three Dimensional Protein Motifs

The comparison of structural subsites in proteins is increasingly relevant to the prediction of their biological function. To address this problem, we have designed the Match Augmentation algorithm (MA). Given a structural motif of interest, such as a functional site, MA searches a target protein structure for a match: the set of atoms with the greatest geometric and chemical similarity. MA is extremely efficient because it exploits the fact that the amino acids in a structural motif are not equally important to function. Using motif residues ranked by functional significance via the Evolutionary Trace (ET), MA prioritizes its search by initially forming matches with functionally significant residues, then, guided by ET, it augments this partial match stepwise until the whole motif is found. With this hierarchical strategy, MA runs considerably faster than other methods, and almost always identifies matches in homologs known to have cognate functional sites. In order to interpret matches, we further developed a statistical method using nonparametric density estimation of the frequency distribution of structural matches. This data driven technique measures the statistical significance of matches in an effort to separate matches which identify functional homologs from functionally unrelated proteins. Preliminary results show that matches identifying cognate active sites are statistically significant.

This is joint work with V.Y. Fofanov, D.M. Kristensen, M. Kimmel, L. Kavraki and O. Lichtarge

Vicky Choi, Virginia Tech

Title: On Updating Torsion Angles of Molecular Conformations

Conformation of a molecule is defined by the relative positions of atoms in the molecule. There are three main representations for conformations of molecules: Cartesian coordinates, distance matrix and internal coordinates. In biochemistry, conformational changes of a molecule are usually described in terms of its internal coordinates. However, for many applications, such as molecular docking, the Cartesian coordinates of atoms are needed for computation. Although for each conformational change, the Cartesian coordinates of atoms can be updated in linear time (which is optimal asymptotically), the constant factor becomes significant if a large number of updates are needed. Zhang and Kavraki examined three methods, called simple rotations, Denavit-Hartenberg local frames and atomgroup local frames. Based on their implementations, they showed that atomgroup local frames are more efficient than the other two. In this paper, by expressing the torsion angle change as a composition of translations and rotations, we observe that the simple rotations can be implemented in an efficient way by taking advantage of consecutive operations. Both quantitative and experimental comparisons show that the improved simple rotations, in which the rotation operation is expressed either in unit quaternion or rotation matrix, is as efficient as atomgroup local frames and thus have the advantage of avoiding the need of precomputations of a set of local frames and transformations between them in the local frames approach. Our source code will be made available publicly.

Claudio Cifarelli, University of Rome "La Sapienza"

Title: On global optimization of the Lennard-Jones potential functions

The minimization of molecular potential energy functions is one of the most challenging nonconvex global optimization problems and plays an important role in the determination of stable states of certain classes of molecular clusters and proteins. In this presentation we discuss several issues related to this problem including optimality conditions, equivalent formulations, and possible bounds on the global minimum of the total potential energy of the Lennard-Jones cluster. Some numerical results are also presented.

References

P.M. Pardalos, D. Shalloway, and G. Xue, "Optimization Methods for Computing Global Minima of Nonconvex Potential Energy Functions" Journal of Global Optimization Vol. 4, No. 2 (1994), pp. 117-133.
P.M. Pardalos, D. Shalloway and G. Xue (Eds), "Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding", DIMACS Series Vol. 23, American Mathematical Society, (1996).
C. Floudas, J. Klepeis, and P.M. Pardalos, "Global Optimization Approaches in Protein Folding and Peptide Docking", In "Mathematical Support for Molecular Biology" (Edited by M. Farach-Colton, F.S. Roberts, M. Vingron, and M. Waterman), DIMACS Series, Vol. 47, American Mathematical Society (1999), pp. 141-171.
C. Floudas and P.M. Pardalos (Eds), "Optimization in Computational Chemistry and Molecular Biology", Kluwer Academic Publishers, (2000).
H. X. Huang, P.M. Pardalos, and Z.J. Shen, "Equivalent formulations and necessary optimality conditions for the Lenard-Jones problem", Journal of Global Optimization Vol. 22, No. 1-4 (2002), pp. 97-118.

Bhaskar DasGupta, UIC

Title: Reverse Engineering of Protein and Gene Networks: how to design experiments efficiently?

In this talk we investigate the computational complexities of a combinatorial problem that arises in the reverse engineering of protein and gene networks. We abstract a combinatorial version of the problem and observe that this is ``equivalent'' to the set multicover problem. We discuss efficient randomized approximation algorithms for this problem and discuss why achieving signifacntly better approximations than ours may be computationally difficult.

Joint work with Piotr Berman (Pennsylvania State University) and Eduardo Sontag (Rutgers University).

Drena Dobbs, Iowa State University

Title: Computational identification of binding sites in proteins

Virtually all cellular processes depend on precisely orchestrated interactions between proteins and other molecules within cells. We are developing computational approaches for predicting which amino acids of a protein participate in protein-protein, protein-DNA and protein-RNA interactions. We have generated several non-redundant datasets of interfaces extracted from macromolecular complex structures in the PDB and used a variety of knowledge-based approaches to generate classifiers capable of predicting which residues of the protein partner(s) are located in the complex interface. In published work, we have shown that machine learning algorithms can be successfully used to identify binding sites in protein-protein complexes using a combination of sequence, structural, and phylogenetic information. More recently, we have used similar approaches to analyze protein-nucleic acid complexes and to generate classifiers capable of accurately predicting which amino acids of the protein partner participate in nucleic acid recognition, using only the amino acid sequence of the protein as input. The level of success obtained in our experiments suggests that these computational approaches will be valuable both for localizing binding domains in structural genomics targets for which structural, but not functional information is available and for functional classification of proteins when structural information is not available. The ability to reliably predict which residues of a protein directly contribute to ligand binding, using a fully automated approach, should advance our understanding of how proteins recognize their interactions partners in cells and potentially contribute to the development of novel therapies for both genetic and infectious diseases.

Joe Dundas, UIC

Title: Sequence order independent structural alignment

Protein structure alignment is an indispensable tool used for many different studies in bioinformatics. Most structural alignment algorithms assume that the structural units of two similar proteins will align sequentially. This assumption is flawed, and as a result proteins with similar structure but with a permutated sequence arrangement are often missed. We present a solution to the problem based on an approximation algorithm that finds a sequence-order independent structural alignment that is close to optimal. We first exhaustively fragment two proteins and calculate a novel similarity score between all possible aligned fragment pairs. We treat each aligned fragment pair as a node on a graph that is connected by an edge, if there are intra-residue sequence conflicts. By regarding the realignment of the fragment pairs as a special case of the maximum-weight independent set problem, we solve this computationally hard problem approximately by iteratively solving relaxations of an appropriate integer programming formulation. The resulting structural alignment is sequence order independent. Our method is insensitive to gaps, insertions, deletions and circular permutations. As an example, we show that our algorithm is capable of automatically detecting proteins that share similar structure but differ by circular permutation. The use of circular permutation studies in aspects of research from protein evolution to protein folding is growing rapidly in spite of the low number of known occurrences of circular permutations.

Jieun Jeong, Pennsylvania State University

Title: Consistent sets of secondary structures in proteins

Several methods exist to predict secondary structures of a protein from its sequence of aminoacids. The methods of our interest make their predictions based on local patterns and they seem to have inherent limitation on their sensitivity/specificity relationship. We propose to apply such methods set to have high sensitivity and low specificity, thus predicting ``too many'' structures and then to solve a global optimization problem to extract a consistent set from the ``redundant'' family of predicted structures. In this talk we formulate several possible optimization problems and heuristics to solve them. Ming-Yang Kao, Northwestern University

Title: Reconstructing a Circular Order from Inaccurate Adjacency Information with Applications in NMR Data Interpretation

Given the adjacency information of $n$ elements in a circular order, our goal is to reconstruct the order. If there are errors in the adjacency data, then the problem is NP-hard in the general case. However, if the adjacency information is derived from each element reporting some numerical values as identifiers for itself and its front neighbor, then an optimal ordering can be defined as the one which minimizes the sum of differences between the numerical identifier of an element from that reported by its neighbor. Our main result is that finding such an optimal ordering takes optimal $\Theta(n\log n)$ time. This setting arises naturally in NMR data interpretation for determining protein structures.

Jie Liang, UIC

Title: Characterizations of protein functions through evolutionary models of structural binding surfaces

Inferring biological roles of proteins and classifying them by their functions are challenging tasks, as global protein sequence and structure similarities are often unreliable for functional inference. Protein plays its role by interacting with other molecules, and local binding surfaces contain direct useful information. To identify locally similar binding surfaces and to assess their biological similarity, scoring matrix such as Pam and Blosum are not suitable, because residues on protein functional surfaces experience different selection pressure than residues in folding core. We develop methods for estimating replacement rates of residues based on a continuous time Markov model using Bayesian Markov chain Monte Carlo. Combined with geometrically computed libraries of millions of binding surfaces, we show our method can infer protein functions from structures with accuracy, and can predict functional roles of proteins from structural genomics with unknown biological roles and with only hypothetical sequence homologs. We further discuss how to construct libraries of binding surfaces for proteins of different functional classes based on a canonical basis set of binding surfaces, and show such an approach can provide general models for classifying proteins by their biochemical roles with sensitivity and specificity.

Jun Liu, Harvard University

Title: Novel Sequential Monte Carlo Techniques and Their Applications to Protein Structure Analysis

I will describe the general sequential Monte Carlo (SMC) methodology, which has become popular in the engineering and statistics communities, and discribe its connections with molecular structure simulations and optimizations. Two new ideas for improving the efficiency of SMC methods will be introduced: using pilot samples to guide for resampling and incorporating a data-dependent optimal resampling criterion. We will illustrate these ideas using examples from Hydrophobic-Hydrophilic (HP) protein model simulation and near-native structure sampling. By applying the new SMC scheme, we were able to characterize many important ensemble properties of NNS, including the size of NNS set, the probability of randomly sampling one NNS structure, and the occurrence of native contacts in NNS. We also found that the widely used pairwise potential function behaved surprisingly bad for stabilizing near native protein structures. Based on the joint work with Junni Zhang, Jinfeng Zhang, Ming Lin, Rong Chen, and Jie Liang. Johannes Rudolph, Duke University

Title: Protein-Protein Interfaces

Protein-protein interactions form the basis for most cellular processes including events such as cell division and growth that are intimately linked to human diseases. Despite their importance and many previous studies, our understanding of the molecular basis for protein-protein interactions remains poor, lagging far behind that of protein folding and structure. We are striving to fill this gap by elucidating universal descriptors of protein interfaces based on composition, shape, and forces. We begin by defining and generating the interface surface between two or more complexed proteins using tools from computational geometry including weighted Voronoi diagrams, weighted Delauney triangulations, and topological persistence. Simply described, the interface surface is a wrinkly (non-flat) sheet halfway between two proteins trimmed where it protrudes beyond the interface. The advantage of the single interface surface over separate surfaces of the contributing proteins is that it provides a well-defined object that can be readily manipulated in calculations and be easily visualized. Analogous to protein structures, we define a hierarchy of structures on the interface surface. Primary structure describes the adjacencies (intra- and inter-molecular) of the residues contributing to the interface. Secondary structures are recurring geometric motifs such as cups and planes. We believe that most interface surfaces consist of a few canonical motifs, equivalent to alpha-helices and beta-sheets in protein structures. The landscape of these motifs forms the tertiary structure of an interface, which we refer to as the fold. Our goal is classifying interfaces through a deeper understanding of these three hierarchical structures and their interdependence. We also study a variety of naturally defined functions on the interface surface including distance to neighboring proteins and electrostatic potentials. We have demonstrated the biochemical relevance of the interface surface by using it to identify hotspot residues and partition the interface into previously recognized fragments. Additionally, we have established a web-based Protein Interface Database for analyzing and disseminating our data.

L. Ridgway Scott, University of Chicago

Title: Digital biology: Relations between data-mining in biological sequences and physical chemistry

The digital nature of biology is crucial to its functioning as an information system. The hierarchical development of biological components (translating DNA to proteins which form complexes in cells that aggregate to make tissue which form organs in different species) is discrete (or quantized) at each step. It is important to understand what makes proteins bind to other proteins predictably and not in a continuous distribution of places, the way grease forms into blobs.

Data mining is a major technique in bioinformatics. It has been used on both genomic and proteomic data bases with significant success. One key issue in data mining is the type of lens that is used to examine the data. At the simplest level, one can just view the data as sequences of letters in some alphabet. However, it is also possible to view the data in a more sophisticated way using concepts and tools from physical chemistry. We will give illustrations of the latter and also show how data mining (in the PDB) has been used to derive new results in physical chemistry. Thus there is a useful two-way interaction between data mining and physical chemistry.

We will give a detailed description of how data mining in the PDB can give clues to how proteins interact. This work makes precise the notion of hydrophobic interaction in certain cases. It provides an understanding of how molecular recognition and signaling can evolve. We also illustrate how this can be used to understand enzymatic activity. This work also introduces a new model of electrostatics for protein-solvent systems.

This talk is based on joint work with Ariel Ferndandez and Kristina Rogale Plazonic.

Mona Singh, Princeton University

Title: Predicting protein interactions and their interfaces

I will discuss a general structural bioinformatics framework for predicting protein interactions that can be applied to specific structural motifs. Our framework incorporates both sequence and experimental data, and we will demonstrate its effectiveness in predicting coiled-coil protein interactions.

I will also discuss recent hardness results and computational methods for the side-chain positioning problem, an important component of predicting and designing protein structures and interfaces. I will present an integer linear programming formulation of the problem, and then show empirically that, surprisingly, in many interesting cases the linear program relaxation finds optimal solutions to the integer program. Our analysis demonstrates that LP-based approaches are highly effective in finding optimal (and near-optimal) solutions for the side-chain positioning problem.

Ilya N. Shindyalov, University of California at San Diego

Title: Functional Annotation of Proteins with Known Structure by Structure and Sequence Similarity, DNA-protein Interaction Patterns and GO Framework

Following the discovery of an increasing number of proteins with known structures there is the need to provide functional annotation that is both highly accurate and consistent. Two approaches have been developed to address this issue: (i) Using sequence and structural similarity to assign function within the Gene OntologyTM (GO) framework; (ii) Using structural similarity and protein-DNA interaction patterns to assign protein function

(i) GO provides consistent annotation in a computer readable and usable form, therefore GO annotation has been assigned to a large number of protein sequences based on direct experimental evidence and through inference by homology. Here we show that this annotation can be extended and corrected for cases where protein structures are available, but function is not. Specifically, using the CE algorithm for structure comparison we extend the protein annotation currently provided by GOA EBI to further annotate the contents of the Protein Data Bank. Specific cases of biologically interesting annotations derived by this method are provided. Given that the relationship between sequence, structure and function is complicated, we explore the impact of this relationship on assigning GO annotation. The effect of superfolds (folds with many functions) is considered by taking into consideration SCOP classification.

(ii) A fully automated classification of DNA-binding protein domains based on existing 3D-structures from the PDB has been implemented. The classification, by domain, relies on Protein Domain Parser (PDP) and the Combinatorial (CE) algorithm for structural alignment. The approach involves the analysis of 3D-interaction patterns in DNA-protein interfaces, assignment of structural domains interacting with DNA, clustering of domains based on structural similarity and DNA-interacting patterns. Comparison with existing resources on describing structural and functional classifications of DNA-binding proteins was used to validate and improve the approach proposed here. In the course of our study we defined a set of criteria and heuristics allowing us to automatically build a biologically meaningful classification and define classes of functionally related protein domains. It was shown that taking into consideration interactions between protein domains and DNA considerably improves the classification accuracy. Our approach provides a high-throughput and annotation of DNA-binding protein families.

The results and additional information on both (i) and (ii) can be found at http://spdc.sdsc.edu.

Vladimir Sobolev, Weizmann Institute of Science, Israel

Title: An approach to semi flexible docking: a case study of the enzymatic reaction catalyzed by terpenoid cyclases

A method is described for semi flexible molecular docking, when part of the small molecule is fixed within the protein matrix. The algorithm is based on the assumption that a part of the ligand molecule (at least 2 atoms) is defined as fixed in a predetermined position within the protein matrix. The approach is applied to model the first two reaction steps in the formation of 5-epi aristolochene by 5-epi-aristolochene cyclase from Nicotiana tabacum.

Zoltan Szabadka, Eotvos University, Hungary

Title: High throughput processing of the structural information of the Protein Data Bank

The Protein Data Bank (PDB) is the largest, most comprehensive, freely available depository of protein structural information, containing more than 30 000 deposited structures. On one hand, the form and the organization of the PDB seems to be perfectly adequate for gathering information from specific protein structures, by using the bibl iographic references and the informative remark fields. On the other hand, however, it seems to be impossible to automatically review remark fields and journal references for processing hundreds or thousands of PDB files.

We present here a family of combinatorial algorithms to solve this problem. Our algorithms are capable to automatically analyze PDB structural information, identify missing atoms, repair chain ID information, and most importantly, the algorithms are capable of identifying ligands and binding sites, facilitating data mining studies of pro ven protein-ligand binding and testing virtual docking algorithms. The algorithm can also be used for predicting flexible sub-chains of the structure.

Sandor Vajda, Boston University

Title: Computational mapping of proteins for exploring the role of binding site plasticity in molecular recognition and information transfer

Computational mapping of proteins moves molecular probes - small organic molecules containing various functional groups . around the protein surface, finds favorable positions, clusters the conformations, and ranks the clusters on the basis of the average free energy (1-3). We employ at least six different probes, retain the five lowest free energy clusters for each probe, and define the positions at which several clusters overlap as the consensus sites. Mapping a number of enzymes has shown that the largest consensus sites (i.e., consensus sites with the highest number of overlapping clusters) are almost always localized to major subsites of the enzyme active site, and as a result, the amino acid residues that interact with the probes also bind the specific ligands of the protein (3). Thus, the method can provide detailed and reliable information on important amino acid residues in the binding site. It is important to note that the mapping finds the active site even when it is not the largest pocket (3), and in this sense it performs better than binding site identification methods based on purely geometric criteria.

In this talk we describe the application of solvent mapping to two classes of proteins, which are the peroxisome proliferator activated receptor-...PPAR....representing nuclear receptors, and cytochromes P450. Results provide fundamental new insight on the mechanism of information processing by these important molecules, and show that the mapping can be used as a powerful tool of analysis to understand the role of relatively small but highly cooperative conformational changes in molecular recognition.

References

Dennis, S., Kortvelyesi, T. & Vajda, S. (2002). Computational mapping identifies the binding sites of organic solvents on proteins. Proc. Natl. Acad. Sci. USA 99, 4290-4295.
Kortvelyesi, T., Dennis, S., Silberstein, M., Brown, L. III & Vajda, S. (2003). Algorithms for computational solvent mapping of proteins. Proteins 51, 340-351.
Silberstein, M., Dennis, S., Brown III, L., Kortvelyesi, T., Clodfelter, K. & Vajda, S. (2003) Identification of substrate binding sites in enzymes by computational solvent mapping, J. Molec. Biol. 332, 1095-1113.
Sheu, S-H., Kaya, T., Waxman, D. J. & Vajda, S. (2005). Exploring the binding site structure of the ..PPAR. ligand binding domain by computational solvent mapping. Biochemistry, 44, 1193-1209.

Zhiping Weng, Boston University

Title: Protein-protein Docking Algorithm Development and Testing

Protein-protein docking is a important approach to study protein-protein interactions. I will describe the efforts in my lab on developing docking algorithms ZDOCK and RDOCK, a docking benchmark, and the testing of the algorithms on the benchmark as well as through the community-wide blind test CAPRI. I will also discuss analysis on transient and obligatory complexes.

Ned Wingreen, Princeton University

Title: Modeling the chemosensing system of E. coli

The chemotaxis network in E. coli is the best studied = signal-transduction network of any living organism. The network enables = E. coli to swim toward attractants such as amino acids or sugars, and = away from repellents. The cells perform chemotaxis by detecting = temporal changes in their chemical environment and transducing this = information into a decision to swim straight or change direction = (tumble). The system is remarkable for its high sensitivity and robust = adaptation over a wide range of external chemical concentrations. = Motivated by recent in vivo FRET studies [1], we model the chemosensing = system as a mixed array of interacting receptors akin to an Ising = lattice. Our results support and extend the robust-adaptation model of = Barkai and Leibler [2], in which receptor complexes function as = two-level systems.

[1] V Sourjik and HC Berg, PNAS 99(1), 123-127 (2002); Ibid, Nature = 428(6981), 437-441 (2004).

[2] N Barkai and S Leibler, Nature 387(6636), 913-917(1997).

Jinbo Xu, MIT

Title: Fast and Accurate Protein Side-Chain Packing

If we know the primary sequence of a protein, can we predict its three-dimensional structure by computational approaches? This is one of the most important and difficult problems in computational molecular biology and has tremendous implications for the field of proteomics.

In this talk, we will focus on protein side-chain packing, one key problem of protein structure prediction. This problem has been proved to be NP-hard and hard to approximate. We will describe a novel approach to the protein side-chain packing problem, based on the tree-decomposition of a protein structure. This algorithm not only runs very efficiently in practice, but also yields a polynomial-time approximation scheme for some types of objective functions.

We have implemented the proposed algorithm combined with several other components as one protein side-chain prediction program TreePack. TreePack achieves an average speed five times faster than SCWRL 3.0, a widely-used side-chain packing program, and runs up to 90 times faster on large proteins. TreePack not only runs very efficiently, but also has the same accuracy as SCWRL.

Ying Xu, University of Georgia

Title: Multi-Scale Structure Prediction of Helical Transmembrane Proteins

As the first step toward a multi-scale, hierarchical computational approach for membrane proteins structure prediction, the packing of transmembrane (TM) helices was modeled at the residual and atomistic levels respectively. For predictions at the residual level, the helix-helix and helix-lipid interactions were described by a set of knowledge-based energy functions. For predictions at the atomistic level, CHARMM19 force field was employed. To facilitate the system to overcome energy barriers, Wang-Landau sampling was carried out by performing a random walk in the energy and conformational spaces. Native-like structures were predicted at both levels for 2- and 7-helix systems. Interestingly, consistent results were obtained from simulations at residual and atomistic levels for the same system, strongly suggesting the feasibility of a hierarchical approach for membrane structure prediction.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on May 27, 2005.