The course is designed to teach students basic and advanced techniques for acquiring and transforming raw information into social and economic data. The 2017 version is particularly aimed at American Ph.D. students who are interested in using public-use and confidential U.S. Census Bureau data, and the confidential data of other American statistical agencies that cooperate with the Census Bureau. We cover the legal, historical, statistical, computing, and social science aspects of the data "production" process. Students will learn the historical background of the U.S. statistical system and the structure of the current system. Major emphasis is placed on U.S. Census Bureau data that are accessible from the Federal Statistical Research Data Center network, which is adminstered by the Census Bureau on behalf of the collaborating statistical agencies.New ways of accessing restricted access data will be  presented. Graduate students and faculty who are planning to use RDC-based data, or are seriously considering it, should pay particular attention to the lab related to the proposal process. The RDC-accessible data products covered in the course include the internal files used to manage the Census Bureau's household and establishment frames; the Longitudinal Employer-Household Dynamics (LEHD) micro data; the Longitudinal Business Database (LBD) and its predecessor the Longitudinal Research Database (LRD); internal versions of the Survey of Income and Program Participation (SIPP), Current Population Survey (CPS), American Community Survey (ACS), American Housing Survey (AHS), and the Decennial Censuses of Population and Housing; the Employer and Non-employer Business Registers (BR and SSEL); the Censuses and Annual Surveys of Manufactures, Mining, Services, Retail Trade, Wholesale Trade, Construction, Transportation, Communications, and Utilities; Business Expenditures Survey; Characteristics of Business Owners; and others. The latter half of the course presents some of the statistical procedures necessary to handle the complex linked data sets increasingly available as confidential data, and will apply some of those techniques in class. The course will be taught in a mixture of (a) web-based self-paced videos followed by online exercises and in-classroom discussion and additional materials  (b) traditional classroom lectures  and labs. All sessions will be recorded.

Core topics

  • Basic statistical principles of populations and sampling frames (no survey background assumed)
  • Acquiring data via samples, censuses, administrative records, transaction logging, and web scraping
  • Law, economics and statistics of data privacy and confidentiality protection
  • Data linking and integration techniques (probabilistic record linking; multivariate statistical matching)
  • Data editing and imputation techniques
  • Analytical methods for complex linked data sets, relational databases, and networks

Learning objectives

  • To understand the history and components of the U.S. federal statistical system, and how these functions are organized in some other countries--you should be able to find the data you want and know who controls access to them
  • To recognize the source data for federal statistical products, and use these files properly even if they are only supported as restricted-access confidential data--once you have the source data you should know how to analyze them whether or not they were edited and released for public-use
  • To understand the data acquisition, edit, imputation, weighting, confidentiality protections, publications, and underlying microdata for major household and business data products in the federal statistical system--in preparing and executing your analysis, you should be able to take responsibility for the data preparation needed to create accurate, useful analysis files
  • To use both spatial, temporal, and network modeling methods, especially Bayesian hierarchical models, as research tools when working with the microdata and public-use files from major household and business data products--you should be able to recognize and model the statistical and econometric complexities that occur when data are aggregated over time and space and from multiple sources
  • To produce replicable, properly curated research results based on confidential and public-use data files--you should know how to document the complete provenance of your analysis and the curation of essential elements for reproduction of your results from the original data files


Lars Vilhuber, Cornell University

Lars Vilhuber

Lars Vilhuber, Ph.D. in Economics, has worked in both academic research and government. His interest in statistical disclosure limitation issues is a consequence of his other research interest: working with highly detailed longitudinally linked data to analyze the effects and causes of mass layoffs, worker mobility, and the dynamics of the local labor market. He is presently on the faculty of the Department of Economics at Cornell University, a Senior Research Associate at the ILR School at Cornell University, Ithaca, Executive Director of the Labor Dynamics Institute, and affiliated with the U.S. Census Bureau (Center for Economic Studies, CES), as well as on the scientific advisory boards of two research data center networks in Canada and France. Over the years, he has also gained extensive expertise on the data needs of economists and other social scientists, having been involved in the creation and maintenance of several data systems designed with analysis, publication, replicability, and maintenance of large-scale code bases in mind. Other interests include dissemination of metadata and disclosure avoidance techniques. He is currently Principal Investigator of the Cornell node of the NSF-Census Research Network (NCRN), as well as the Principal Investigator on the network's Coordinating Office.

[more info]

Warren Brown, Cornell University

Warren Brown

Warren A. Brown is Senior Research Associate at Cornell University where he directs the Program on Applied Demographics and is the Research Director of the Cornell site of the New York Federal Statistical Research Data Center, a consortium of research institutions in the New York metropolitan area and upstate New York. He is also the 2015-2016 President of the Association of Public Data Users (APDU) and serving on the National Academy of Science’s Standing Committee on Reengineering Census Operations. His teaching, research and outreach efforts involve him with the application of demographic information to areas such as strategic planning for workforce and economic development, consumer behavior and market analysis, households and housing market analysis, regional transportation planning, hospitality and recreation industries, health services for the elderly, and environmental protection. He is an expert on the American Community Survey.


Classes start on August 24, 2017.


The development of this course was sponsored by the National Science Foundation as part of the NSF-Census Research Network (NCRN), under grant #1131848 to the Cornell node, with additional support for network-wide dissemination through the NCRN coordinating grant #1237602, the office of the Kenneth F. Kahn Dean of the ILR School, and the Labor Dynamics Institute. Previous versions of this

Other information


Print Friendly, PDF & Email