### DIMACS Workshop on Recent Work on Differential Privacy across Computer Science

#### October 24 - 26, 2012 DIMACS Center, CoRE Building, Rutgers University

Organizers:
Aaron Roth, University of Pennsylvania, aaroth at cis.upenn.edu
Adam Smith, Pennsylvania State University, asmith at cse.psu.edu
Presented under the auspices of the DIMACS Special Focus on Information Sharing and Dynamic Data Analysis and the DIMACS Special Focus on Cybersecurity.

#### Abstracts:

Kamalika Chaudhuri, UC San Diego

Title: The Price of Privacy: Sample Complexity Bounds for Differentially Private Classification

We study the problem of differentially private classification -- namely, how to learn a predictor or classifier from sensitive training data, while still preserving the privacy of individuals in the data. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? Previous work studied this question in the context of discrete data distributions, and provided an upper bound, and lower bounds for some specific classes. In the first part of this talk, we show a new lower bound that holds for general classification problems which obey certain fairly-loose conditions.

In the second part of this talk, we consider this question in the context of learning infinite hypothesis classes on continuous data distributions. We show that in this case, even for very simple hypothesis classes, any algorithm that uses a finite number of samples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible (as was shown by previous work).

This talk is based on joint work with Daniel Hsu.

Graham Cormode, AT&T Labs

Title: Data-driven concerns in Privacy

Many users are clamouring for privacy techniques so that they can publish, share or sell their data safely. There are many aspects of data that then have to be managed: the data is often sparse, multi-dimensional, and structured. Each of these requires mechanisms that are aware of these features, so that they provide output which is useful and efficient to compute, in addition to providing a privacy guarantee. In this talk, I will outline some recent efforts to generate such practical mechanisms.

Anupam Datta, Carnegie Mellon University

Title: Differentially Private Data Analysis of Social Networks via Restricted Sensitivity

We introduce the notion of restricted sensitivity as an alternative to global and smooth sensitivity to improve accuracy in differentially private data analysis. The definition of restricted sensitivity is similar to that of global sensitivity except that instead of quantifying over all possible datasets, we take advantage of any beliefs about the dataset that a querier may have, to quantify over a restricted class of datasets. Specifically, given a query f and a hypothesis H about the structure of a dataset D, we show generically how to transform f into a new query f_H whose global sensitivity (over all datasets including those that do not satisfy H) matches the restricted sensitivity of the query f. Moreover, if the belief of the querier is correct (i.e., D is in H) then f_H(D) = f(D). If the belief is incorrect, then f_H(D) may be inaccurate.

We demonstrate the usefulness of this notion by considering the task of answering queries regarding social-networks, which we model as a combination of a graph and a labeling of its vertices. In particular, while our generic procedure is computationally inefficient, for the specific definition of H as graphs of bounded degree, we exhibit efficient ways of constructing f_H using different projection-based techniques. We then analyze two important query classes: subgraph counting queries (e.g., number of triangles) and local profile queries (e.g., number of people who know a spy and a computer-scientist who know each other). We demonstrate that the restricted sensitivity of such queries can be significantly lower than their smooth sensitivity. Thus, using restricted sensitivity we can maintain privacy whether or not D is in H, while providing more accurate results in the event that H holds true.

Joint work with Jeremiah Blocki, Avrim Blum, and Or Sheffet at Carnegie Mellon University.

Sharon Goldberg, Boston University

Title: A Workflow for Differentially-Private Graph Synthesis

We present a new workflow for differentially-private publication of graph topologies. First, we produce differentially private measurements of interesting graph statistics using our new version of the PINQ programming language, Weighted PINQ, which is based on a generalization of differential privacy to weighted sets. Next, we show how to generate graphs that fit any set of measured graph statistics, even if they are inconsistent (due to noise), or if they are only indirectly related to actual statistics that we want our synthetic graph to preserve. We do this by combining the answers to Weighted PINQ queries with a probabilistic inference technique that allows us to synthesize graphs where the statistic of interest aligns with that of the protected graph. I will show how to cast a few graph statistics as queries in Weighted PINQ, and then present experimental results of synthetic graphs generated from the answers to these queries.

Joint work with Davide Proserpio (BU) and Frank McSherry (MSR).

Title: New Results in Differentially Private Empirical Risk Minimization

We discuss differentially private algorithms for convex empirical risk minimization (ERM) (a special class of learning problems). I will present some recent work on private ERM in the "classical" setting (where the number of data samples (n) exceeds the dimensionality (p) of the ERM problem). After reviewing existing results on private ERM, namely, output perturbation [RBHT09, CMS11] and objective perturbation [CM08,CMS11], I will discuss three improvements to the objective perturbation algorithm of CMS'': i) Allowing the use of non-differentiable convex regularizer and convex constraints in the optimization, ii) Use of Gaussian perturbation to improve the dependence of the error on dimensionality (p), iii) Showing tighter utility guarantees under local strong convexity of the risk function.

Finally, I will discuss a new approach for private ERM using online convex programming and compare its guarantees with both output and objective perturbation.

Joint work with Prateek Jain [MSR, India], Daniel Kifer [Penn State], Pravesh Kothari [University of Texas, Austin] and Adam Smith [Penn State].

Andreas Haeberlen, University of Pennsylvania

Title: Differentially Private Join Queries over Distributed Databases

In this talk, I will discuss the problem of answering queries about private data that is spread across multiple different databases. For instance, a medical researcher may want to study a possible correlation between travel patterns and certain types of illnesses. The necessary information exists today -- e.g., in airline reservation systems and hospital records -- but it is maintained by two separate companies who are prevented by law from sharing this information with each other, or with a third party. This separation prevents the processing of such queries, even if the final answer, e.g., a correlation coefficient, would be safe to release.

I will describe a system called DJoin that can process such distributed queries and can give strong differential privacy guarantees on the result. DJoin can support many SQL-style queries, including joins of databases maintained by different entities, as long as they can be expressed using DJoin's two novel primitives: BN-PSI-CA, a differentially private form of private set intersection cardinality, and DCR, a multi-party combination operator that can aggregate noised cardinalities without compounding the individual noise terms.

Moritz Hardt, IBM Research

Title: Incoherence and privacy in spectral analysis of data

Matrix incoherence is a frequently observed property of large real-world matrices. Intuitively, the coherence of a matrix is low if the singular vectors of the matrix bear little resemblance to the individual rows of the matrix. We show that this property is quite useful in the design of differentially private approximate singular vector computation and low-rank approximation.

Our algorithms for these tasks turn out to be significantly more accurate under a low coherence assumption than what a well known worst-case lower bound would suggest. While not straightforward to analyze, our algorithms are highly efficient and easy to implement.

Based on joint works with Aaron Roth.

Michael Hay, Colgate University

Title: Differentially Private Join Queries over Distributed Databases

This talk describes iReduct, a differentially private algorithm for computing answers with reduced relative errors. The basic idea of iReduct is to allocate different amounts of noise to different query results, so that smaller values are injected with less noise than large values (subject to overall privacy budget constraints). The algorithm is based on a novel resampling technique that employs correlated noise to refine existing noisy measurements. We conduct experiments on answering low-order marginals, and find the technique achieves lower relative error than other strategies based on Laplace mechanism. This work appeared at SIGMOD 2011.

Shiva Kasiviswanathan, GE Research

Title: The Power of Linear Reconstruction Attacks

We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Consider a database curator who manages a database of sensitive information but wants to release statistics about how a sensitive attribute (say, disease) in the database relates to some nonsensitive attributes (e.g., postal code, age, gender, etc). This setting is widely considered in the literature, partly since it arises with medical data. We show one can mount linear reconstruction attacks based on any release that gives: a) the fraction of records that satisfy a given non-degenerate boolean function. Such releases include contingency tables as well as more complex outputs like the error rate of certain classifiers such as decision trees; b) any one of a large class of M-estimators including the standard estimators for linear and logistic regression.

We make two contributions: first, we show how these types of releases can be transformed into a linear format, making them amenable to existing polynomial-time reconstruction algorithms. Second, we show how to analyze the resulting attacks under various distributional assumptions on the data.

Joint work with Mark Rudelson and Adam Smith.

Katrina Ligett, Caltech

Title: A Simple and Practical Algorithm for Differentially Private Data Release

In this talk, we present a new algorithm for differentially private data release, based on a simple combination of the Exponential Mechanism with the Multiplicative Weights update rule. Our algorithm achieves what are the best known and nearly optimal theoretical guarantees, while at the same time being simple to implement and experimentally more accurate on actual data sets than existing techniques.

Joint work with Moritz Hardt and Frank McSherry. To appear in NIPS 2012.

Ashwin Machanavajjhala, Duke University

Title: Pufferfish: A Semantic Approach to the Privacy of Correlated Data

Tremendous amounts of personal data about individuals are being collected and shared online. Legal requirements and an increase in public awareness due to egregious breaches of individual privacy have made data privacy an important field of research. Recent research, culminating in the development of a powerful notion called differential privacy, have transformed this field from a black art into a rigorous mathematical discipline. One of the attractive features of differential privacy is that it is purportedly agnostic to the auxiliary information an adversary can use to breach privacy of individuals.

This talk describes how differentially private algorithms may not guarantee privacy in situations when data is correlated -- e.g.,when there are prior releases of exact counts, or in social networks. We present Pufferfish, an alternate rigorous semantic approach to defining privacy, which explicitly defines which information is kept secret, what adversaries are tolerated, and bounds the information disclosed about each secret to each adversary. We prove that differential privacy is equivalent to a specific instantiation of this semantic framework, wherein a mechanism satisfies differential privacy if and only if it ensures bounded information disclosure to a set of adversaries who believe the individuals in the data are independent of each other. Finally, I will conclude with ongoing work on developing private mechanisms for correlated data. The talk is based on joint work with Daniel Kifer and Bolin Ding.

Ilya Mironov, Microsoft Research

Title: Friends don't let friends use floating point arithmetic.

We describe several challenges, real and hypothetical, of using floating-point arithmetic in implementations of differentially private systems. In the talk we will recall basic principles of floating-point arithmetic, limitations of its abstraction, and specific examples of differentially private functionalities over the reals that go astray if implemented without proper attention to floating-point arithmetic. We will discuss problems with a textbook implementation of the Laplacian mechanism, and ways of fixing it.

Aleksandar Nikolov, Rutgers University

Title: The Geometry of Differential Privacy: the Sparse Case

We study the problem of answering a set of d non-adaptively chosen linear queries to a database of size n represented by a histogram x over a universe of size N under the constraints of exact and approximate differential privacy. In matrix notation, this corresponds to computing the linear map Ax where A is a d by N matrix and the L_1 norm of x is at most n. This problem encapsulates many classes of queries of interest, for example contingency tables and range counting. When the database size n is less then d, there exist significantly more accurate differentially private algorithms than in the case of unbounded n. We design polynomial time (epsilon, delta) and (epsilon,0)-differentially private algorithms that, for any given A and any n, achieve nearly optimal mean squared error.

Our algorithms first compute noisy answers by adding noise sampled from a correlated gaussian distribution chosen independently of the database, and then employ least squares estimation to reduce the error. We prove optimality on (epsilon, delta)-differential privacy using discrepancy lower bounds on blatant non-privacy, leveraging results in asymptotic convex geometry by Vershynin, and Bourgain and Tzafriri. We use ideas from sparse linear regression in analyzing least squares estimation.

Joint work with Kunal Talwar and Li Zhang.

Rafael Pass, Cornell University

Title: Crowd-Blending Privacy

We present a new definition of privacy called crowd-blending privacy that strictly relaxes the notion of differential privacy, and is based on the notion of "blending in a crowd". We demonstrate crowd-blending private mechanisms for histograms and for releasing synthetic data points, achieving strictly better utility than what is possible using differentially private mechanisms. Additionally, we demonstrate that if a crowd-blending private mechanism is combined with a pre-sampling'' step, where the individuals in the database are randomly drawn from some underlying population (as is often the case during data collection), then the combined mechanism satisfies not only differential privacy, but also the stronger notion of zero-knowledge privacy. This holds even if the pre-sampling is slightly biased and an adversary knows whether certain individuals were sampled or not. Taken together, our results yield a practical approach for collecting and privately releasing data while ensuring higher utility than previous approaches.

Based on joint work with Johannes Gehrke, Michael Hay, and Rafael Pass. (Appeared in Crypto 2012)

Title: Graph Analysis with Node-Level Differential Privacy

We develop algorithms for the private analysis of network data that provide accurate statistics on realistic networks while satisfying stronger privacy guarantees than those of previous work. We present several techniques for designing node differentially private algorithms, that is, algorithms whose output distribution does not change significantly when a node and all its adjacent edges are added to a graph. We also develop methodology for analyzing the accuracy of such algorithms on realistic networks.

The main idea behind our techniques is to project'' (in one of several senses) the input graph onto the set of graphs with maximum degree below a certain threshold. We design projection operators, tailored to specific statistics, that have low sensitivity and preserve information about the original statistic. These operators can be viewed as giving a fractional (low-degree) graph that is a solution to an optimization problem described as a maximum flow instance, linear program, or convex program. In addition, we derive a generic, efficient reduction that allows us to apply any differentially private algorithm for bounded-degree graphs to an arbitrary graph. This reduction is based on analyzing the smooth sensitivity of the naive'' truncation that simply discards vertices of high degree.

Joint work with Shiva Kasiviswanathan, Kobbi Nissim and Adam Smith.

Omer Reingold, Microsoft Research

Title: Privacy, Fairness, and Social Norms

Concerns for privacy in the Information Age are often intermixed with concerns for fairness. For example, a viable risk in Targeted Advertising is that tracking information on users could be used to discriminate against or steer away minority groups. We will discuss a recent work on fairness in classification and its possible connections to privacy. One of the main insights is that true fairness towards a protected group cannot be obtained by "blindness" for the membership in the group. Our individual-based notion of fairness thus promotes what we refer to as "fairness through awareness." Therefore, while the definition and techniques of differential privacy are very relevant to our work, we conclude that privacy does not imply fairness. We will speculate on the reverse direction. Namely, to what extent does fairness imply privacy?

Our definition of fairness puts social norms center stage. In other words, a major insight we follow is that no mathematical definition can replace social debate regarding what fairness should mean in any particular context. We would argue that, similarly, the definition of differential privacy should not be taken as the end of the discussion regarding what privacy should mean but rather its starting point. We will discuss an approach for providing policy makers tools to understand and define privacy.

Based on joint work with Cynthia Dwork, Moritz Hardt, Toni Pitassi, and Rich Zemel as well as ongoing work with Cynthia Dwork, Guy Rothblum and Salil Vadhan

Aaron Roth, University of Pennsylvania

We consider -large games-, which, like commuter traffic or stock investing, are strategic interactions among many players, each of whom has only a small affect on the welfare of others. One might like to design mechanisms in such games to suggest equilibrium play to the participants, but there are two potential concerns. The first is privacy: the computation of an equilibrium may reveal sensitive information about the utility function of one of the agents (i.e. his destination for that days drive, or his stock portfolio). The second concern relates to incentives: it may be beneficial for one of the agents to misreport his utility function to cause the mechanism to select a preferred, purported "equilibrium" of the reported game. We show how differential privacy can be brought to bear to solve both of these problems: we give a privacy preserving mechanism for computing the equilibria of a large game, which in turn implies an approximately truthful equilibrium selection mechanism.

This is joint work with Michael Kearns, Mallesh Pai, and Jon Ullman.

Anand Sarwate, TTI

Title: Near-Optimal Algorithms for Differentially-Private Principal Components

Principle components analysis (PCA) is an important technique for reducing the dimensionality of data, and is often used as a preprocessing step in machine learning algorithms. Although the data may be presented in a large dimension d, they may lie close to a k-dimensional linear subspace. We propose using the exponential mechanism to approximate the top-k PCA subspace. We analyze the sample complexity of this procedure and show that it scales linearly in the data dimension. We prove a lower bound showing that this scaling is essentially tight. We furthermore give a lower bound on the sample complexity for input perturbation (the SULQ method), showing that it scales like d^{3/2}. We show how to implement our procedure using a Markov Chain Monte Carlo approach and explore the privacy-utility tradeoff for this problem on real data.

Or Sheffet, Carnegie Mellon University

Title: The Johnson Lindenstrauss Lemma Itself Preserves Differential Privacy

We show that an "old dog", namely -- the classical Johnson-Lindenstrauss transform, performs new tricks'' -- it gives a novel way of preserving differential privacy. We show that if we take two databases, D and D', such that (i) D'-D is a rank-1 matrix of bounded norm and (ii) all singular values of D and D' are sufficiently large, then multiplying either D or D' with a vector of iid normal Gaussians yields two statistically close distributions in the sense of differential privacy. Furthermore, a small, deterministic and public alteration of the input is enough to assert that all singular values of D are large.

We apply the Johnson-Lindenstrauss transform to the task of approximating cut-queries: the number of edges crossing a (S,\bar S)-cut in a graph. We show that the JL transform allows us topublish a sanitized graph that preserves edge differential privacy (where two graphs are neighbors if they differ on a single edge) while adding only O(|S|/\epsilon) random noise to any given query (w.h.p). Comparing the additive noise of our algorithm to existing algorithms for answering cut-queries in a differentially private manner, we outperform all others on small cuts (|S| = o(n)).

We also apply our technique to the task of estimating the variance of a given matrix in any given direction. The JL transform allows us to publish a sanitized covariance matrix that preserves differential privacy w.r.t bounded changes (each row in the matrix can change by at most a norm-1 vector) while adding random noise of magnitude independent of the size of the matrix (w.h.p). In contrast, existing algorithms introduce an error which depends on the matrix dimensions.

Elaine Shi, University of Maryland

Title: Privacy-preserving Distributed Data Analysis

We consider a privacy-preserving distributed data analysis scenario, where data is distributed among users (or organizations), and users may not wish to entrust their sensitive data to a centralized party such as a cloud service provider. Participating parties are mutually distrustful, and a subset of the parties may be compromised and colluding.

We prove new lower bounds on the utility of distributed data analysis satisfying information theoretic differential privacy. To circumvent such lower bounds in practice, we consider computational differential privacy. We demonstrate new constructions that satisfy computational differential privacy while achieving asymptotically smaller error. Our constructions also offer several properties that will be desirable in a practical deployment scenario, including support for periodic aggregation, no peer-to-peer communication, and fault tolerance.

Dan Suciu, University of Washington

Title: A Theory of Pricing Private Data

Personal data has value to both its owner and to institutions who would like to analyze it. While the goal of privacy mechanisms is to hide the owner's data while releasing to analysts noisy versions of aggregate query results, in many scenarios the common practice is quite different. Consumers are happily providing their private information to Internet companies, in return for some free service, and these companies monetize it without compensating the owners. One study mentions that a unique user is worth $4 to Facebook and$24 to Google.

However, the awareness of the value of the personal data increases, and there exists an increasing drive to compensate the end user for her private information, when that is used in aggregate queries. In this talk we discuss a theoretical framework for assigning prices to noisy query answers. The buyer asks for an aggregate query, and indicates the amount of noise for which he is willing to pay: a query with more noise is cheaper. His payment (or part thereof) is used to compensate the data owners, in proportion to their privacy loss. Our study focuses on the issues that arise when the buyer is allowed to ask an arbitrary number of queries; we characterize situations of arbitrage, and give necessary and sufficient conditions to avoid them.

Joint work with: Chao Li, Daniel Yang Li, Gerome Miklau

Kunal Talwar, Microsoft Research

Title: The Geometry of (eps,delta) Differential Privacy

We study the problem of answering a set of d non-adaptively chosen linear queries to a database represented by a histogram x over a universe of size N under the constraint of approximate differential privacy. In matrix notation, this corresponds to computing the linear map Ax where A is a d by N matrix. This problem encapsulates many classes of queries of interest, for example contingency tables and range counting. Given such an A, what is the minimum mean squared error achievable by any (eps,delta) differentially private mechanism? For the case of (eps,0) differential privacy, [HT10,BDKT12] gave a polylogarithmic approximation to this problem. In this work, we give a polylogarithmic approximation for the (eps,delta) case. Our mechanism is very simple and computes noisy answers by adding noise sampled from a correlated gaussian distribution chosen independently of the database. We prove optimality on (epsilon, delta)-differential privacy using discrepancy lower bounds on blatant non-privacy, leveraging results in asymptotic convex geometry. Our work also shows that for such queries, the gap between (eps,0) and (eps,delta) differential privacy is at most polylogarithmic in N and d. In the process, we also give the first known polylogarithmic approximation algorithm for Hereditary Discrepancy.

Joint work with Aleksandar Nikolov and Li Zhang.

Jonathan Ullman, Harvard University

Title: Answering n^{2+o(1)} Counting Queries with Differential Privacy is Hard

A central problem in differentially private data analysis is how to design efficient algorithms capable of answering large numbers of counting queries of the form "What fraction of individual records in the database satisfy the property q?" We prove that if one-way functions exist, then there is no algorithm that takes as input a database D in ({0,1}^d)^n, and k = O(n^2 log n) arbitrary efficiently computable counting queries, runs in time poly(d,n), and returns an approximate answer to each query, while satisfying differential privacy. We also consider the complexity of answering "simple" counting queries, and make some progress in this direction by showing that the above result holds even when we require that the queries are computable by constant depth (AC0) circuits.

Our result is almost tight in the sense that nearly n^2 counting queries can be answered efficiently while satisfying differntial privacy. Moreover, super-polynomially many queries can be answered in exponential time.

We prove our results by extending the connection between differentially private counting query release and cryptographic traitor-tracing schemes to the setting where the queries are given to the sanitizer as input, and by constructing a traitor-tracing scheme that is secure in this setting.

Title: The Privacy of the Analyst and The Power of the State

We initiate the study of *privacy for the analyst* in differentially private data analysis. That is, not only will we be concerned with ensuring differential privacy for the data (i.e.individuals or customers), which are the usual concern of differential privacy, but we also consider (differential) privacy for the set of queries posed by each data analyst. The goal is to achieve privacy with respect to other analysts, or users of the system.

This problem arises only in the context of *stateful* privacy mechanisms, in which the responses to queries depend on other queries posed (a recent wave of results in the area utilized cleverly coordinated noise and state in order to allow answering privately hugely many queries).

We argue that the problem is real by proving an exponential gap between the number of queries that can be answered (with non-trivial error) by stateless and stateful differentially private mechanisms. We then give a stateful algorithm for differentially private data analysis that also ensures differential privacy for the analyst and can answer exponentially many queries.

Joint work with Cynthia Dwork and Moni Naor.

David Xiao, Laboratoire d'Informatique Algorithmique

Title: Is privacy compatible with truthfulness?

Individuals value their privacy, and therefore are unlikely to reveal their private information without incentives. Working in a game-theoretic framework, we study what incentives are necessary and sufficient in order for individuals to reveal their private information. We aim to build truthful mechanisms: the mechanism should be designed so that it is in each player's interest to reveal their actual private information.

We give a simple and computationally efficient transformation that converts any truthful mechanism with small type space into a differentially private and truthful mechanism, all while preserving the quality of the output. We then present negative results that show that when one explicitly modifies the players' utility functions to take into account their loss in privacy, then even the stringent condition of differential privacy sometimes may not be sufficient to encourage individuals to divulge their private information.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center