The course is designed to teach students basic and advanced techniques for acquiring and transforming raw information into social and economic data. The 2017 version is particularly aimed at American Ph.D. students who are interested in using public-use and confidential U.S. Census Bureau data, and the confidential data of other American statistical agencies that cooperate with the Census Bureau. We cover the legal, historical, statistical, computing, and social science aspects of the data "production" process. Students will learn the historical background of the U.S. statistical system and the structure of the current system. Major emphasis is placed on U.S. Census Bureau data that are accessible from the Federal Statistical Research Data Center network, which is adminstered by the Census Bureau on behalf of the collaborating statistical agencies.New ways of accessing restricted access data will be  presented. Graduate students and faculty who are planning to use RDC-based data, or are seriously considering it, should pay particular attention to the lab related to the proposal process. The RDC-accessible data products covered in the course include the internal files used to manage the Census Bureau's household and establishment frames; the Longitudinal Employer-Household Dynamics (LEHD) micro data; the Longitudinal Business Database (LBD) and its predecessor the Longitudinal Research Database (LRD); internal versions of the Survey of Income and Program Participation (SIPP), Current Population Survey (CPS), American Community Survey (ACS), American Housing Survey (AHS), and the Decennial Censuses of Population and Housing; the Employer and Non-employer Business Registers (BR and SSEL); the Censuses and Annual Surveys of Manufactures, Mining, Services, Retail Trade, Wholesale Trade, Construction, Transportation, Communications, and Utilities; Business Expenditures Survey; Characteristics of Business Owners; and others. The latter half of the course presents some of the statistical procedures necessary to handle the complex linked data sets increasingly available as confidential data, and will apply some of those techniques in class. The course will be taught in a mixture of (a) web-based self-paced videos followed by online exercises and in-classroom discussion and additional materials  (b) traditional classroom lectures  and labs. All sessions will be recorded.

Core topics

  • Basic statistical principles of populations and sampling frames (no survey background assumed)
  • Acquiring data via samples, censuses, administrative records, transaction logging, and web scraping
  • Law, economics and statistics of data privacy and confidentiality protection
  • Data linking and integration techniques (probabilistic record linking; multivariate statistical matching)
  • Data editing and imputation techniques
  • Analytical methods for complex linked data sets, relational databases, and networks

Learning objectives

  • To understand the history and components of the U.S. federal statistical system, and how these functions are organized in some other countries--you should be able to find the data you want and know who controls access to them
  • To recognize the source data for federal statistical products, and use these files properly even if they are only supported as restricted-access confidential data--once you have the source data you should know how to analyze them whether or not they were edited and released for public-use
  • To understand the data acquisition, edit, imputation, weighting, confidentiality protections, publications, and underlying microdata for major household and business data products in the federal statistical system--in preparing and executing your analysis, you should be able to take responsibility for the data preparation needed to create accurate, useful analysis files
  • To use both spatial, temporal, and network modeling methods, especially Bayesian hierarchical models, as research tools when working with the microdata and public-use files from major household and business data products--you should be able to recognize and model the statistical and econometric complexities that occur when data are aggregated over time and space and from multiple sources
  • To produce replicable, properly curated research results based on confidential and public-use data files--you should know how to document the complete provenance of your analysis and the curation of essential elements for reproduction of your results from the original data files


Lars Vilhuber, Cornell University

Lars Vilhuber

Lars Vilhuber, Ph.D. in Economics, has worked in both academic research and government. His interest in statistical disclosure limitation issues is a consequence of his other research interest: working with highly detailed longitudinally linked data to analyze the effects and causes of mass layoffs, worker mobility, and the dynamics of the local labor market. He is presently on the faculty of the Department of Economics at Cornell University, a Senior Research Associate at the ILR School at Cornell University, Ithaca, Executive Director of the Labor Dynamics Institute, and affiliated with the U.S. Census Bureau (Center for Economic Studies, CES), as well as on the scientific advisory boards of two research data center networks in Canada and France. Over the years, he has also gained extensive expertise on the data needs of economists and other social scientists, having been involved in the creation and maintenance of several data systems designed with analysis, publication, replicability, and maintenance of large-scale code bases in mind. Other interests include dissemination of metadata and disclosure avoidance techniques. He is currently Principal Investigator of the Cornell node of the NSF-Census Research Network (NCRN), as well as the Principal Investigator on the network's Coordinating Office.

[more info]

Warren Brown, Cornell University

Warren Brown

Warren A. Brown is Senior Research Associate at Cornell University where he directs the Program on Applied Demographics and is the Research Director of the Cornell site of the New York Federal Statistical Research Data Center, a consortium of research institutions in the New York metropolitan area and upstate New York. He is also the 2015-2016 President of the Association of Public Data Users (APDU) and serving on the National Academy of Science’s Standing Committee on Reengineering Census Operations. His teaching, research and outreach efforts involve him with the application of demographic information to areas such as strategic planning for workforce and economic development, consumer behavior and market analysis, households and housing market analysis, regional transportation planning, hospitality and recreation industries, health services for the elderly, and environmental protection. He is an expert on the American Community Survey.

Sylverie Herbert

Sylverie Herbert

Ph.D. student in Economics, with interests in Macroeconomics and Finance.

Guest Lecturers

We draw on expert guest lecturers for a variety of topics. A complete updated list is available here.


Session 0: Course Introduction
Aug 24 @ 4:25 pm – 6:00 pm

We will introduce the teaching environment (technical and organizationally), and present the class itself.

Lecture notes

  • INFO7470 2017 Course Introduction  (PPTXPDF)

Print Friendly, PDF & Email
Session 1: Overview of the U.S. Statistical System
Aug 31 @ 4:25 pm – 6:00 pm

An overview of the U.S. statistical system is given.


Lecture notes

Print Friendly, PDF & Email
Session 2: History of the Federal Statistical Infrastructure
Sep 7 @ 4:25 pm – 6:00 pm

Margo Anderson (University of Wisconsin – Milwaukee) presents on the history of the federal statistical system (flipped classroom). She will be present to discuss the lecture.

Readings and other information

Lecture Notes

Historical Perspectives on the U.S. Federal Statistical System

About the Guest Lecturer

Margo Anderson, University of Wisconsin – Milwaukee

Margo Anderson

Margo Anderson is Distinguished Professor of History & Urban Studies at the University of Wisconson - Milwaukee. She specializes in American social, urban and women's history and has research interests in both urban history and the history of the social sciences and the development of statistical data systems, particularly the census. Her publications include Who Counts? The Politics of Census Taking in Contemporary America (2001), coauthored with Stephen E. Fienberg, and a coedited volume with Victor Greene, Perspectives on Milwaukee's Past (University of Illinois Press, 2009). Her most recent publication, of particular relevance to this class, is The American Census: A Social History, Second Edition. Yale University Press, 2015. More information about Margo can be found at her University of Wisconsin-Milwaukee website and her personal website.

Print Friendly, PDF & Email
Session 3: [No class] Universes, Populations, Frames, and Sampling
Sep 14 @ 4:25 pm – 4:30 pm

This class coincides with FSRDC system’s annual conference. There will be no in-classroom activity at most sites on this day (please check with local coordinator). The content of this section will be discussed on Sept 21, 2017, so students should take the time to view the materials on edX during this week.

Lecture notes

Print Friendly, PDF & Email
Session 4: Measuring People and Households
Sep 21 @ 4:25 pm – 6:00 pm
Session 5: Measuring Business and Economic Activity
Sep 28 @ 4:25 pm – 6:00 pm

This lecture is a “flipped” lecture. 

Lecture Notes


The lab will be posted on edX.

Print Friendly, PDF & Email
Session 7: Data from Other Statistical Agencies and Other Sources
Oct 12 @ 4:25 pm – 6:00 pm

Health statistics, energy statistics, agricultural statistics, others. Registered-based statistics, organic data. Details to come.

Lecture Notes

Session 8: Census Geography
Oct 19 @ 4:25 pm – 6:00 pm

This will be “flipped classroom” on Geographic Information Systems (GIS) – basic geocoding, geographic concepts, and other topics.

Lecture Notes

Print Friendly, PDF & Email
Session 9: Restricted Access Data and Replicability
Oct 26 @ 4:25 pm – 6:00 pm

Flipped classrom about access to restricted access data. Students will be introduced to the research proposal mechanism of the Federal Statistical Research Data Center.

Discussion will focus on how to access various restricted access data sets. Guest presenters may be present live in the videoconference classroom.

Part 3 switches gears, and discusses the need for and the requirements of replicable science (in general, and in restricted-access environments). This part is a live lecture by Lars Vilhuber.

Lecture Notes

Additional links

Print Friendly, PDF & Email
Session 10: Statistical Tools – Record Linkage and Total Quality Evaluation
Nov 2 @ 4:25 pm – 6:00 pm

The class is flipped classroom, with discussion by “guest lecturer” John Abowd.

Introduction to record linking

  • What is record linking, what is it not, what is the theory?
  • Record linking: applications and examples – How do you do it, what do you need, what are the possible complications?
  • Examples of record linking

Total quality evaluation – errors from coverage, sampling, edit, and imputation.

Print Friendly, PDF & Email
Session 11: Statistical Tools – Edit and Imputation
Nov 9 @ 4:25 pm – 6:00 pm
  • Formal models of edits and imputations
  • Missing data overview
  • Missing records – Frame or census – Survey
  • Missing items
  • Overview of different products
  • Overview of methods
  • Formal multiple imputation methods

Lecture Notes


The lab (an edit and imputation exercise) will be posted on the INFO7470x edX site. You will need to create a program, and upload the program (language of your choice) to edX.

Print Friendly, PDF & Email
Session 12: Statistical Tools – Disclosure Limitation Methods – Synthetic Data
Nov 16 @ 4:25 pm – 6:00 pm
  • Why must users of restricted-access data learn about confidentiality protection?
  • What is statistical disclosure limitation?
  • What are privacy-preserving data mining and differential privacy?
  • Basic methods for disclosure avoidance (SDL)
  • Rules and methods for model-based SDL
  • SDL-based noise methods
  • Synthetic data
  • Differential privacy methods

Lecture Notes

Supplementary Materials

Print Friendly, PDF & Email
No class (Cornell Thanksgiving Recess)
Nov 23 all-day
Session 13: Statistical Tools – Geographic and Network Analysis Methods
Nov 30 @ 4:25 pm – 6:00 pm

Flipped class

  • Part A: Spatial Analysis (Nicholas Nagle of University of Tennessee – Knoxville)
  • Part B: Network Analysis (John Abowd, Cornell University)

Part A: Spatial Analysis


  • Basic Geocoding
  • Tools for Geocoding
  • Analysis Methods
  • Tools for Geographic Analysis

Lecture Notes

About the Guest Lecturer

Nicholas Nagle, University of Tennessee – Knoxville

Nicholas Nagle

Nicholas Nagle is a GIScientist/geospatial analyst whose research centers on combining spatial data in order to produce more reliable geographic information. Prof. Nagle holds a joint faculty appointment with the Geographic Information Science and Technology group at Oak Ridge National Laboratory. He is currently working on a number of projects improving the availability and reliability of data from the US Census Bureau, developing methods to identify land cover change, and is working on a number of projects related to population and health, both in Tennessee and in developing countries.

Part B: Network Analysis

This part of the lecture is a live class.

Lecture Notes

About the Guest Lecturer

John Abowd, Cornell University and now U.S. Census Bureau

John Abowd

John M. Abowd is currently the Associate Director for Research and Methodology and Chief Scientist, United States Census Bureau, on leave from Cornell University. At Cornell, he is the Edmund Ezra Day Professor of Economics, Professor of Statistics and Information Science at Cornell University, and the Director of the Labor Dynamics Institute (LDI) at Cornell. He previously served as a Distinguished Senior Research Fellow at the United States Census Bureau (1998-2015). He is also a Research Associate at the National Bureau of Economic Research (NBER, Cambridge, MA), Research Affiliate at the Centre de Recherche en Economie et Statistique (CREST, Paris, France), Research Fellow at the Institute for Labor Economics (IZA, Bonn, Germany), and Research Fellow at IAB (Institut für Arbeitsmarkt-und Berufsforschung, Nürnberg, Germany). He is the outgoing President (2014-2015) and Fellow of the Society of Labor Economists, a past Chair (2013) of the Business and Economic Statistics Section and a Fellow of the American Statistical Association. He is an Elected Member of the International Statistical Institute and a Fellow of the Econometric Society. He previously served on the National Academies’ Committee on National Statistics (2010- 2016) and on the American Economic Association’s Committee on Economic Statistics. He served as Director of the Cornell Institute for Social and Economic Research (CISER) from 1999 to 2007.

Print Friendly, PDF & Email
No final exam
Dec 7 all-day

The class does not have a final exam. The last class at Cornell is on November 30. Check with your local coordinator about any local arrangements.

Print Friendly, PDF & Email


Any student enrolled in a Ph.D. or Masters program at one of the participating universities may take this course. Students at Cornell register for INFO 7470 (or ECON7400/ ILRLE7400, identical). Some programming experience (in any statistical programming language) is required for some of the labs. Some statistical or econometric training is required for some of the lectures.


In addition to local registration rules for each of the participating sites, all students will also register in a edx Edge class. The URL will be updated at a later time.


The course has two types of lectures. The first half of the course (roughly Sessions 3-8) is in a "flipped classroom" setting, whereas the latter half is a more traditional lecture style. The type of lecture will be identified for each date on the course calendar.

Flipped classroom

  • In-classroom discussions and additional materials and guest lectures will occur at the time and date listed on the course calendar.
  • Lectures should be viewed on edx Edge prior to the classroom time
  • Exercises and labs on edx Edge should be completed prior to the classroom time
  • In-classroom time is expected to be shorter than the usual full length, presumably 1 hour, but dependent on classroom participation

Traditional lecture

  • In-classroom lecture and discussion will occur at the time and date listed on the course calendar.
  • The expected length corresponds to the listed time.
  • Additional materials may be made available on edx Edge
  • Exercises and labs will be uploaded and graded on edx Edge.


There is one lab per week. Timing depends on type of lecture (see above).


Students taking the course for credit are expected to attend every lecture and complete all labs. Any student who wishes to take the course for a grade must complete all labs. Final grade is based on lab grades.The course will be letter graded and you will receive the grade from the instructor of record at your university.


Auditing students are expected to attend every lecture and complete all labs. Casual auditors will not be allowed.

Technical requirements

For technical requirements for remote sites, please consult this document.
Print Friendly, PDF & Email