Lars Vilhuber and John Abowd
The Virtual Research Data Center at Cornell University has been a successful research support tool for users of many of the Census Bureau large-scale confidential data products including, but not limited to, those that are accessible via the Census Research Data Center network. Over 200 computational users and 600 download users have benefited from the VirtualRDC resources. Their scientific publications cite the NSF grants that supported the development of the VirtualRDC. The proposed activity seeks to keep this support network flourishing. In addition, most social science researchers face substantial hurdles when they wish to harness the power of large-scale computational clusters, in particular when using new, very large synthetic data sets with their unprecedented detail on people, jobs, and firms. The proposed activity seeks to extend the VirtualRDC model to allow support of tera-scale social science computing via the NSF-sponsored TeraGrid resources. The most widespread statistical software packages used by social scientists, i.e., SAS, Stata, and SPSS, are not available on the TeraGrid itself or on any of the servers at the borders of the TeraGrid with fast connections to it. When viewing the problem through the lens of the typical data-driven research process (extract, edit and transform data; transfer data to a computational location; and perform analysis) social science researchers are typically constrained in at least one of these steps when approaching the high-performance computing clusters on the TeraGrid. For most data preparation, and for much analysis, the lack of standard statistical analysis and data preparation software packages is a serious impediment. However, the typical social scientist workstation or university-provided computational infrastructure does not have the resources to handle these very large data sets. Furthermore, the social science workstation and the university-provided infrastructure do not have sufficiently fast data connectivity to transfer any large prepared data files to the TeraGrid for processing there. This project aims to remedy bottlenecks in the first and second steps, with a focused expansion of resources at a critical location resulting in a highly useful gateway to the TeraGrid for the social sciences. The project builds a social science TeraGrid gateway that (i) allows researchers to perform the data preparation step using their comfort-level software packages, speeding up the data preparation phase, and (ii) do so on servers that have a fast connection to the TeraGrid, thus greatly speeding up the data-transfer process. The third bottleneck absence of social statistics packages on the TeraGrid is not addressed by this proposal, since it would require resources, in particular licensing resources, an order of magnitude larger than our proposed budget. This step is left to future proposals.