Project #4

Synthetic Data User Testing and Dissemination

Principal investigators

Lars Vilhuber and John Abowd


National Science Foundation, Grant SES-1042181

Short description

Researchers throughout the social, behavioral, economic, and health sciences use data to test hypotheses about a wide range of individual and social behaviors, decisions, and outcomes. Government statistical agencies regularly collect data that are extremely valuable for this purpose. However, these data are not made directly available to the research community because the data providers' (respondents') identity is part of the data itself. Therefore statistical agencies and the scientific community have been developing methods to make analytically valid and highly detailed data available to researchers while simultaneously protecting individual privacy.

A particularly valuable and sensitive kind of data is linked administrative data such as the Longitudinal Employer-Household Data (LEHD), the Longitudinal Business Database (LBD) and surveys with linked administrative data (SIPP). These datasets have been constructed with support from statistical agencies and the NSF. The highly detailed nature of these data make them particularly sensitive, and access to the micro-data remains restricted. One approach for balancing the tension between confidentiality protection and access is the generation of synthetic data. The process for generating such data begins by estimating a posterior predictive distribution (PPD) of the to-be-released data given the confidential micro-data. The next step is to draw samples from the PPD to produce the released micro-data. The quality of inferences based on a wide variety of models applied to synthetic and actual data has been inadequately assessed to date because only a limited number of users have had access to both data sources. This kind of assessment needs to be integrated within a quality-feedback loop in order to improve synthetic data and increase the use of the data by the research community. This award facilitates such a feedback loop for synthetic versions of two datasets: the Census Bureau's Survey of Income and Program Participation and the Longitudinal Business Database. The goal is to broaden access to the data, enhance the feedback loop, and provide flexible and secure access to these synthetic data early releases.

A variety of social scientists from a range of disciplines will be able to use this data access method and will provide detailed input that will guide future improvements in data quality.

Additional Resources