CDI: Integrating Statistical and Computational Approaches to Privacy
Data privacy is a fundamental problem of the modern information infrastructure. Increasing volumes of personal and sensitive data are collected and archived by health networks, government agencies, search engines, social networking websites, and other organizations. The social benefits of analyzing these databases are significant. At the same time, the release of information from sensitive data repositories can be devastating to the privacy of individuals and organizations. The challenge is to discover and release important characteristics of these databases without compromising the privacy of those whose data they contain. The main goal of this project is to design scalable computational techniques that are statistically sound, yield broadly useful data, and yet preserve privacy in the face of realistic external information. The project aims to integrate two essentially different approaches to the complex problem of data privacy. The reconciliation of these approaches raises a number of fundamental questions for statistical theory and cryptography, as well as methodological challenges that must be overcome to enable practical applications. This research is centered around three themes: (1) Integrating the computationally-focused, rigorous definitions of privacy emanating from computer science with notions of utility from statistics. (2) Developing cryptographic protocols for distributing privacy-preserving algorithms for valid statistical analysis among a group of servers so as to avoid pooling data in any single location. (3) Understanding the practical potential of the developed techniques by applying them to concrete problems in the behavioral and social sciences and analyzing important data sources from the official statistical community. The research will be carried out in collaboration with social scientists and industry researchers.
The project will increase awareness of data privacy issues and promote research on statistical disclosure limitation, cryptography and privacy-preserving data mining. Moreover, this research will transform the way statistical agencies, social scientists, medical researchers, and those in industry approach privacy?in particular, how they collect, share and publish information. The integration of statistical and cryptographic methods in the form of ex ante provably secure procedures will provide the essential scientific fundamentals for official statistical agencies to fulfill their mission of useful data production, which the proliferation of digital information has endangered. Finally, the new techniques will permit opening the vault of industrial data, such as search logs and data on social networks, to statistical analysis greatly expanding the research domain of the social and health sciences.