Cornell NSF-Census Research Node: Integrated Research Support, Training and Data Documentation
John M. Abowd, William Block, Ping Li, Lars Vilhuber
Warren Brown, Carl Lagoze
The era of public-use micro-datasets as a cornerstone of empirical research in the social sciences is coming to an end. While it still is feasible to create such data without breaching confidentiality, scholars are pursuing research programs that mandate inherently identifiable data, such as geospatial relations, exact genome data, networks of all sorts, and linked administrative records. These researchers acquire authorized restricted access to the confidential identifiable data and perform their analyses in secure environments. The researcher is allowed to publish results that have been filtered through a statistical disclosure limitation protocol. Scientific scrutiny is hampered because the researcher cannot effectively implement a data-management plan that permits sharing these restricted-access data with other scholars. The data-custody problem is impeding the "acquire, archive, and curate" model that dominated social science data preservation in the era of public-use micro-data. This project will bridge the transition to restricted-access data and offer the scholar, the scientific community, and the custodial agency a feasible path to long-term data preservation. The Comprehensive Census Bureau Metadata Repository (CCBMR) will be a Data Documentation Initiative-based curation system designed and implemented in a manner that permits synchronization between the public and confidential versions of the repository. The scholarly community will use the CCBMR as it would use a conventional metadata repository, deprived only of the values of certain confidential information, but not their metadata. The authorized user, working on the secure Census Bureau network, will use the CCBMR with full information in authorized domains. There is no duplication of effort, and the project will implement fully automatic disclosure avoidance review of the metadata where feasible. The preservation function operates indefinitely on the original scientific inputs as long as the researchers cooperate and the agency continues to fund the preservation component of the CCBMR. Doctoral students will be taught how to develop research programs using restricted-access Census Bureau data and the repository tools developed in this project in combination with previously developed tools. The same tools will be used to develop computational statistics algorithms based on boosting to improve the integration, editing, and imputation models that assemble the micro-data used for the Census Bureau's longitudinally linked employer-employee database.
Because the Confidential Information Protection and Statistical Efficiency Act of 2002 formalized the obligation of every statistical agency in the United States to take long-term custody of the confidential micro-data used for its work, all federal statistical agencies face the same problem as the Census Bureau. The CCBMR, the education based on this repository, and the collaborative computational statistics model all can be generalized to meet the restricted-access research requirements of other statistical agencies. These tools allow statistical agencies to harness the efforts of researchers who want to understand the structure and complexity of the confidential data they intend to analyze in order to propose and implement reproducible scientific results. Future generations of scientists will be able to build on those efforts because the long-term data preservation in the CCBMR will operate on the original scientific inputs, not inputs that have been subjected to statistical disclosure limitation prior to entering the repository. This curation will result in a viable system for enforcing data management plans on projects, ensuring that results can be tested and replicated by future scientists. This activity is supported by the NSF-Census Research Network funding opportunity.