Title: Moving Away from G(n,p)
The most popular model of random networks, G(n,p), was introduced in 1959 by Erdos and Renyi: given a set of n vertices, include each possible edge independently with probability p. The main attraction of this model is its extreme technical simplicity, stemming from the perfect independence between edges. Unfortunately, this simplicity is tantamount to a lack of geometry: for example, the shortest path metric of G(n,p) is often used to provide dimension lower bounds for embedding into normed spaces. Moreover, the growth of many real networks, rather than being monotone in ``time", reflects the unfolding of an optimization tradeoff.
We would like to build models of random networks that reflect these issues. And we want to do so without ``building-in" emergent properties, e.g., by specifying the degree distribution. So, we propose studying random graphs whose evolution is influenced by simple, but globally-acting constraints. We will present (mostly experimental) results for two concrete such notions: i) a miniscule twist on G(n,p), motivated by the evolution of chromosomes, which seems to make a big difference and, more ambitiously, ii) random graphs with constrained total wire (edge) length.
Joint work with Ricardo Menchaca-Mendez.
Title: Local Graph Partitioning using Pagerank Vectors
A local graph partitioning algorithm finds a cut near a specified starting vertex, with a running time that depends largely on the size of the small side of the cut, rather than the size of the input graph. We give a local partitioning algorithm using a variation of PageRank with a specified starting distribution. We derive a mixing result for PageRank vectors similar to that for random walks, and show that the ordering of the vertices produced by a PageRank vector reveals a cut with small conductance. In particular, we show that for any set $C$ with conductance $\Phi$ and volume $k$, a PageRank vector with a certain starting distribution can be used to produce a set with conductance $O(\sqrt{\Phi \log k})$. We present an improved algorithm for computing approximate PageRank vectors, which allows us to find such a set in time proportional to its size. By combining small sets found by this local partitioning algorithm, we obtain a cut with conductance $\phi$ and approximately optimal balance in time $O(m \log^4 m/\phi^2)$.
Joint work with Fan Chung and Kevin Lang
Title: Solving Massive Graph Problems using Petascale Computing
Graph theoretic problems are representative of fundamental kernels in traditional and emerging computational sciences such as chemistry, biology, and medicine, as well as applications in national security. Yet they pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. Few parallel graph algorithms outperform their best sequential implementation due to long memory latencies and high synchronization costs. In this talk, we consider several graph theoretic kernels for connectivity and centrality and discuss how the features of petascale architectures will affect algorithm development, ease of programming, performance, and scalability.
Title: Network Inference and Analysis for Systems Biology
Genome sequencing has become easy, but understanding how genes and proteins assemble into a network remains hard. I will discuss our group's work on this problem. We have been developing data-mining algorithms for inferring gene modules from high-throughput genetic screens. The algorithms work on generic multi-partite graph structures and should be applicable to a wide range of problems. In a second approach, we have been developing a computational scheme to compute gene regulatory network wiring diagrams directly from DNA and protein sequence using all-atom simulations. Finally, I will describe an extension of capture-recapture statistics that we have developed to estimate false-positive and false-negative rates for network edges (biological interactions) revealed by noisy high-throughput experiments.
Title: Activating interactions and the dynamics of biological networks
Many studies of biological networks have analyzed network `wiring diagrams' (topological features). Such work has suggested that specific types of network structure may increase network robustness and therefore confer a selective advantage. However, knowledge of network topological features does not allow one to predict network dynamical behavior---for example, whether deleting a protein from a signaling network would maintain steady-state behavior, or induce oscillations or chaos. In this talk I show that the balance between activating and inhibiting connections is important in determining whether a network reaches steady state or oscillates. Using a simplified mathematical model that captures the essential behavior of a network of interacting genes or proteins, we studied randomly sampled networks, networks subjected to selection for specific properties, and topologies of real biological networks. In all cases, the ratio of activating to inhibiting connections affects whether the network reaches steady state or oscillates. Therefore, the fraction of activating connections may be a control parameter that cells use to predispose a network to oscillate or reach steady state.
Title: Depletion of Feedback Loops in Large Scale Biological Networks
The local structure of biological and man-made complex systems can be studied by enumerating network motifs. Since searching for large size arbitrary motifs is cumbersome, we limit our focus on cycles and utilize a parallelized program to search for large-size cycles and characterize their statistical properties in different networks. Studying biological complex systems and man-made complex systems abstracted to directed networks, we will show that the statistics of directional correlations (i.e. links coming in and out of each node) can be captured by a simple physical model, a Pott's spin system. Most of the networks display strong negative correlations in the directions of links along cycles, and are locally described as ``anti-ferromagnets'', i.e. nodes tend to be either sources of sinks of directional links. However, this negative correlation is determined not only by local properties, but also by global ones, as architectural randomizations tend to be more positively correlated than the original networks. We will also present an evolutive algorithm that recreates most of these novel topological properties, along with traditional ones, and emphasizes the interplay of local and global interactions in the determination of directionality. Finally, we will show that this particular topological feature has a dramatic impact on the dynamics of the networks, as it tends to minimize dynamical interference of signals reverberating across nodes, and in particular significantly reduces the number of feedback loops.
Title: Self-Assembly Networks
We consider a file distributed over a network in which each node, except the root node, has at most some small segment of the file; the root is responsible for keeping the entire file. When a client at some node v requests the entire file, the file segment at v begins downloading immediately to the client, and, at the same time, the client's request is forwarded to all of v's neighbors, which in turn send their segments down to v and recursively continue the process throughout the network. When set up correctly, segments making up the remainder of the file will arrive seamlessly at the client's node. We discuss the interesting modeling and algorithmic questions posed by the design of such systems. Our model entails a biomolecular metaphor. The analysis focuses on tree networks.
Title: Where do Power Laws Come From?
What distribution of graphical degree sequence is invariant under "scaling"? Are these graphs always power law graphs? We show the answer is a surprising "yes" for sparse graphs if we ignore the isolated vertices or more generally the vertices with degrees less than a fixed constant k. We obtain a concentration result on the degree sequence of a random induced subgraph. The case of hypergraphs (or set-systems) has also been examined.
Title: The Architecture of Robustness
This talk focuses on architectural and organizational principles of networked systems, building on insights about the fundamental nature of complex biological and technological networks drawn from three converging research themes.
1) With molecular biologys detailed description of components and growing attention to systems biology the organizational principles of biological networks are becoming increasingly apparent.
2) Advanced technologys complexity is now approaching biologys. While the components differ, there is striking convergence at the network level of architecture and the role of layering, protocols, and feedback control in structuring complex multiscale modularity. New theories of the Internet and related networking technologies have led to test and deployment of new protocols for high performance networking (www.hot.caltech.edu, netlab.caltech.edu).
3) A new mathematical framework for the study of complex networks suggests that this apparent network-level evolutionary convergence within/between biology/technology is not accidental, but follows necessarily from the universal system requirements to be efficient, adaptive, evolvable, and robust to perturbations in their environment and component parts. (www.cds.caltech.edu/sostools, www.cds.caltech.edu/~doyle)
Title: The Optimization Origins of Preferential Attachment
We show how the mechanism of preferential attachment can emerge from an underlying network optimization framework. The preferential attachment (PA) model so obtained has two novel features, saturation and viability, which have natural interpretations in the underlying network. Like PA, saturation has previously been assumed at an axiomatic level. The combination of PA and saturation leads to power-law degree distributions with exponential cutoff which give excellent fits to a broad range of empirical observations of networks. Here we show how a simple underlying optimization framework can give rise to both known mechanisms and likewise to a new concept of viability, and suggest that such models form a good starting point for the analysis of many networks. In addition we discuss the fit provided to a broad range of data, including previously unexplained data on the Internet obtained from "whois" tables.
This is joint work with Jennifer Chayes, Christian Borgs, Noam Berger and Bobby Kleinberg.
Title: Trouble with Web Matrices and Pagerank
We shall explain the basic PageRank algorithm. In our experience, the less "damping" used, the more "unstable" the PageRank vector is. Attempting to understand this points out a lot of perplexing properties of web matrices.
First, this instability is not present in all Markov chains. Secondly, it is not always easy to link this instability to power laws or other properties of web matrices that we know. Specifically, if A is a Markov matrix and v is a probability vector (e.g., the "teleportation"), as i increases, A^i v seems to have erratic properties, including the fact that websites with a lot of A^i v weight for some i values may be insignificant for larger or smaller values of i. Also, it is not clear how the "instability" is affected by the choice of normalization. Studying these problems suggests new notions in numerical analysis, such as a "cross condition matrix" for eigenvalue stability.
This is joint work with Chen Greif.
Title: Recurring Mathematical and Computational Problems in Biology
In my work on several scientific and engineering projects involving the computational modeling of molecular biology data, I have observed a small number of recurring mathematical and computational problems. These problems often involve exponentially large search spaces, probability, machine learning and graph-based methods, so computer scientists, engineers and mathematicians should be able to help solve them. In this talk, I will describe these recurring problems, explain why they are important in biology, and offer some insights into how to proceed in finding solutions.
Title: Designing Threshold Networks with Given Structural and Dynamical Properties
Threshold networks contain a highly-compressible layered structure and allow fast computation of many network properties such as degree distribution, clustering, and degree correlation. We show how to construct arbitrarily large, sparse, threshold networks with (approximately) any prescribed degree distribution or Laplacian spectrum. Control of the spectrum allows careful study of synchronization properties of threshold networks including the relationship between heterogeneous degrees and resistance to synchrony.
Title: PROCTOR: An algorithm for reconstructing the internal interaction topology of protein complexes
Recent advances in high-throughput experimental techniques have given us a wealth of protein interaction data, rich in quantity and variety. While the quantity and variety of data present special challenges for modeling, they also present unique opportunities for gaining insight into the proteome by allowing us to leverage multiple perspectives. We present PROCTOR (PROtein Complex TOpology Reconstruction), an algorithm for inferring the internal interaction topology of protein complexes from both direct protein- protein interaction data (e.g., yeast two-hybrid: Y2H) and from protein co-complex data (e.g., affinity purification-mass spectrometry: AP-MS). While other methods have attempted to use both these kinds of data, they usually require that co-complex data first be transformed into pairwise protein-protein interaction data under a 'spoke' or 'clique' model, a transformation we do not require because we learn the hidden internal topology.
We evaluate PROCTOR by combining all available data from eight high- throughput proteomic datasets, resulting in coverage of over 90% of the yeast proteome. PROCTOR exhibits significantly better sensitivity and specificity than other algorithms for predicting domain-domain interactions from Y2H and AP-MS data. Additionally, it provides us with an improved view of the global protein-protein interaction network in yeast by reducing both the false positive and false negative error rates of the observed data. Finally, PROCTOR elucidates the topology of AP-MS purifications, both recovering known complexes (e.g., Arp2/3 and RNA polymerase II) and suggesting new complexes. In doing so, PROCTOR provides us with a global view of how proteins interact with each other via their domains to form macromolecular complexes.
Title: Optimization Problems in Social Networks
A social network - the graph of relationships and interactions within a group of individuals - plays a fundamental role as a medium for the spread of information, ideas, influence, or diseases among its members. An idea or innovation will appear, and it can either die out quickly or make significant inroads into the population. Similarly, an infectious disease may either affect a large share of the population, or be confined to a small fraction.
The collective behavior of individuals and the spread of diseases in a social network have a long history of study in sociology and epidemiology. In this talk, we will investigate graph-theoretic optimization problems relating to the spread of information or diseases. Specifically, we will focus on two types of questions: influence maximization, wherein we seek to identify influential individuals to start a cascade of an innovation to maximize the expected number of eventual adopters; and infection minimization, wherein we seek to remove nodes so as to keep a given infected component small.
We will present constant factor and bicriteria algorithms for versions of these problems, and also touch on open problems and issues regarding competition among multiple innovators.
(This talk represents past joint work with Jon Kleinberg, Eva Tardos, Ara Hayrapetyan, Martin Pal, and Zoya Svitkina, as well as ongoing work with Shishir Bharathi and Mahyar Salek.)
Title: Structure and Evolution of Online Social Networks
Online social networks have become a major phenomenon on the web. We analyze the structure and evolution of the LiveJournal blogspace and the Flickr social network. We then formulate a simple model of social networks that can explain the success of Milgram's famous experiment that gave rise to `six-degrees of separation'.
Title: Partitioning Real-World "Power-Law" Graphs
We discuss some properties of real-world "power-law" graphs that make them challenging inputs for graph partitioning, especially the balance problem, namely that the best quotient cuts and normalized cuts are highly unbalanced.
We exhibit a number of graphs with this problem, and discuss some experiments and measurements that might shed more light on its nature. For example, the graph might have some kind of fractal structure; or it might contain respectable clusters that happen to have a wide range of sizes; or there could be a very large number of tiny nearly disconnected pieces.
We also discuss how the main theoretical and practical graph partitioning algorithms behave on these graphs, and present some ideas for improving them. Algorithms will include Metis, Leighton-Rao, Spectral and SDP-based methods, plus KRV and some KRV-like algorithms that are more practical.
Title: Using Lovasz Local Lemma in the Space of Random Matching
The Lovasz Local Lemma is known to have an extension for cases where independence is missing but negative dependencies are under control. We show that this is often the case for random matchings, and we provide easy-to-check conditions for the non-trivial task of verifying a negative dependency graph for random matchings. This gives an alternative probabilistic tool for random permutations, random injections, random regular graphs, and the configuration model for random graphs of given fixed degree sequence. We will show some applications in these different areas, including existence results for hyper-graph packing and Tur\'an type extremal problems.
Title: Scalable Algorithms for Vector Space Computations in Complex Data Environments
In numerous Internet applications, data with complex interrelationships are modeled by matrices and analyzed with vector space or convex analysis methods. Consider, e.g., term-document matrices or the adjacency of a large graph. A difficulty applying these techniques in large or complex data environments is that traditional algorithms require at least quadratic or cubic time. Recall, e.g., over-constrained least squares approximation (or more generally over-constrained Lp regression) and the related low-rank matrix approximation. Several randomized algorithms for these ubiquitous problems that run qualitatively faster than traditional algorithms are described; applications to learning in Internet data contexts are discussed; and several limitations and extensions of these ideas are described.
Title: Full Web Pageranking on a Laptop
While the enormous scale of the world wide web makes it a vastly interesting and informative artifact, its size quickly becomes a burden, imposing substantially on the types of analyses that can be performed. Even simple analyses like PageRank require dedicated compute clusters to run effectively. In this talk, I will describe some recent experience designing and building a system that supports full-web PageRank analyses on a commodity laptop.
The work brings together algorithmic improvements, structural observations about the web graph, as well as some old fashioned engineering. I will report on a set of PageRanking measurements taken across a 4B page crawl, as performed on my laptop over the course of a day or two. The techniques generalize to full web SVD, SALSA, and loopy belief propagation.
Title: Untangling Biological Networks using Maximum Entropy Priors on Graphs
I will talk about the problem of untangling real biological interaction networks from noisy measurements of these networks. In these networks, nodes represent genes and edges represent a particular type of interaction between a pair of genes. This problem has become increasingly important in the last few years as more and more experimental techniques are being developed to measure different types of interaction networks on a large-scale, and existing techniques are being applied to different organisms. In this talk, I will concentrate on protein-protein interaction (PPI) networks which are undirected networks in which the edges connect genes whose proteins physically interact in the cell. I will present an energy-based model which uses maximum entropy priors on both the true interaction (signal) network and the network of false positive (noise) interactions. Our structural priors score networks in terms of their degree distributions and clustering coefficients. Exact inference in our model is intractable; I will describe an efficient approximate inference algorithm to compute edge appearance posteriors. I will also describe the evaluation of our model on a real-world problem and some surprising results generated out of that evaluation.
Title: Stochastic Models on Networks, Games and Reconstruction
I will survey some recent results in network reconstruction and games on network. Special emphasis will be given to to discussion of the interplay between hard combinatorial constraints and the stochastic nature of the models and the data.
Title: Complex Structures in Complex Networks
A typical approach for investigating the structure of complex networks is to focus on some structural property or feature of interest and construct a measure for detecting and quantifying that particular feature. This has been a successful approach; many types of structure in networks have been discovered in this way. But it is possible for large networks to contain far more complex structural features than those we can easily think of and describe, including patterns and correlations on many different scales. This talk will present some new methods for revealing structural features of networks in an automated fashion that do not require us to know in advance the particular type of structure we are looking for. These methods allow us to perform "exploratory" data analysis, in which the data themselves lead to new hypotheses about network structure.
Title: Towards Topology Aware Networks
Overlay networks are the main design strategy by which we evolve the Internet. We focus on questions of maintaining good topology connectivity and topology awareness by decentralized protocols in unstructured P2P networks.
Title: Web Search and Online Communities
In this talk, I'll cover some emerging trends in web search, with a digression into some non-traditional search problems. I'll also cover some recent results on the dynamics of online community formation. Finally, I'll describe ongoing work at Yahoo at the intersection of web search and community, and will discuss how community and interpersonal relationships will color the way we think about search in the future.
Title: Modeling Biological Systems from Heterogeneous Genomics Data
Broad availability of diverse functional genomic data should enable fast and accurate prediction of protein function, interactions, and regulation. However, these data remain vastly underutilized and have not caused a truly substantial change in our understanding of biology. I will discuss some reasons for such gap between data and understanding in genomics, including computational and biological diversity of the data, high noise levels, and need to involve biology researchers in the analysis. I will also present our resent work in addressing this problem, including context-sensitive prediction of biological networks, integration of gene expression datasets, and development of a functional genomics evaluation framework.
Title: Degree Correlations in Real and Model Networks: Measures, Origin, and Consequences
Recent studies have shown that several graphs representing real systems are characterized by a power law degree distribution, resulting in significant changes in other graph properties such as the average distance between nodes, the size of the giant component and spreading processes. In this work I show that once the degree distribution is specified, the degree correlations between adjacent vertices strongly modulates those graph properties, in some cases resulting in striking differences depending on the type and magnitude of these correlations. To support this claim I review some recent work done in collaboration with other authors about the characterization of degree correlations and their impact on the size of the giant component, percolation problems, spreading processes and computational complexity.
Title: Core-Dense Graphs and Hypergraphs
Core-dense graphs are a common generalization of dense graphs and metrics. We discuss their motivation, definition and extension to hypergraphs. The class of maximum constraint satisfaction problems with r literals per constraint (MAX-r-CSP) corresponding to core-dense hypergraphs has a polynomial-time approximation scheme. This unifies and extends classes of CSP's known to have PTAS's. Our main tools are a method for low-rank tensor approximation and a simple scaling. The algorithm also applies to problems with a constant number of additional global constraints, e.g., maximum weighted bisection.
This is joint work with W. Fernandez de la Vega, R. Kannan and M. Karpinski.
Title: A Geometrical Preferential Attachment Model of Networks
We study a random graph $G$ that combines certain aspects of geometric random graphs and preferential attachment graphs. The vertices of $G$ are $n$ sequentially generated points x1,x2, ...,xn chosen uniformly at random from the unit sphere. After generating xt, we randomly connect it to m points chosen among x1, x2,...,x{t-1} with probability proportional to degree and some function of the distance. We show that G has a power law degree sequence and small diameter. Unlike the preferential attachment graph, this geometric preferential attachment graph has small separators, similar to some experimental observations.
Joint work with Abraham Flaxman and Alan Frieze