We are currently working on an edX-based online version of the class - stay tuned! In the meantime, previous versions of the course, including video recordings, can be found here.

Course Summary

The course is designed to teach students basic and advanced techniques for acquiring and transforming raw information into social and economic data. The 2013 version is particularly aimed at American Ph.D. students who are interested in using confidential U.S. Census Bureau data, and the confidential data of other American statistical agencies that cooperate with the Census Bureau. We cover the legal, statistical, computing, and social science aspects of the data "production" process. Major emphasis is placed on U.S. Census Bureau data that are accessible from the Bureau's Research Data Center network. Graduate students and faculty who are planning to use RDC-based data, or are seriously considering it, should pay particular attention to the labs related to the proposal process. The RDC-accessible data products covered in the course include the internal files used to manage the Census Bureau's household and establishment frames; the Longitudinal Employer-Household Dynamics (LEHD) micro data; the Longitudinal Business Database (LBD) and its predecessor the Longitudinal Research Database (LRD); internal versions of the Survey of Income and Program Participation (SIPP), Current Population Survey (CPS), American Community Survey (ACS), American Housing Survey (AHS), and the 1990, 2000, and 2010 Decennial Censuses of Population and Housing; the Employer and Non-employer Business Registers (BR and SSEL); the Censuses and Annual Surveys of Manufactures, Mining, Services, Retail Trade, Wholesale Trade, Construction, Transportation, Communications, and Utilities; Business Expenditures Survey; Characteristics of Business Owners; and others. Students will also be introduced to the NSF-sponsored Virtual Research Data Center.

Core topics include:

  • Basic statistical principles of populations and sampling frames (no survey background assumed)
  • Acquiring data via samples, censuses, administrative records, transaction logging, and web scraping
  • Law, economics and statistics of data privacy and confidentiality protection
  • Data linking and integration techniques (probabilistic record linking; multivariate statistical matching)
  • Data editing and imputation techniques
  • Analytical methods for complex linked data sets, relational databases, and networks
The live version is tentatively scheduled for Spring 2016. We are working on an edX-based online version of the class - stay tuned! Previous versions of the course, including video recordings, can be found here.
The development of the 2013 version of the course was sponsored by the National Science Foundation as part of the NSF-Census Research Network (NCRN), under grant #1131848 to the Cornell node, with additional support for network-wide dissemination through the NCRN coordinating grant #1237602, the office of the Kenneth F. Kahn Dean of the ILR School, and the Labor Dynamics Institute. (More information)