- Basic statistical principles of populations and sampling frames (no survey background assumed)
- Acquiring data via samples, censuses, administrative records, transaction logging, and web scraping
- Law, economics and statistics of data privacy and confidentiality protection
- Data linking and integration techniques (probabilistic record linking; multivariate statistical matching)
- Data editing and imputation techniques
- Analytical methods for complex linked data sets, relational databases, and networks
- To understand the history and components of the U.S. federal statistical system, and how these functions are organized in some other countries--you should be able to find the data you want and know who controls access to them
- To recognize the source data for federal statistical products, and use these files properly even if they are only supported as restricted-access confidential data--once you have the source data you should know how to analyze them whether or not they were edited and released for public-use
- To understand the data acquisition, edit, imputation, weighting, confidentiality protections, publications, and underlying microdata for major household and business data products in the federal statistical system--in preparing and executing your analysis, you should be able to take responsibility for the data preparation needed to create accurate, useful analysis files
- To use both spatial, temporal, and network modeling methods, especially Bayesian hierarchical models, as research tools when working with the microdata and public-use files from major household and business data products--you should be able to recognize and model the statistical and econometric complexities that occur when data are aggregated over time and space and from multiple sources
- To produce replicable, properly curated research results based on confidential and public-use data files--you should know how to document the complete provenance of your analysis and the curation of essential elements for reproduction of your results from the original data files
Lars Vilhuber, Cornell University[more info]
Warren Brown, Cornell University
We draw on expert guest lecturers for a variety of topics. A complete updated list is available here.
Margo Anderson (University of Wisconsin – Milwaukee) presents on the history of the federal statistical system (flipped classroom). She will be present to discuss the lecture.
Readings and other information
- Anderson, Margo. The American Census: A Social History, Second Edition. Yale University Press, 2015.
- Anderson, Margo J., and Seltzer, William. “Federal Statistical Confidentiality and Business Data: Twentieth Century Challenges and Continuing Issues’.” Journal of Privacy and Confidentiality 1.1 (2009): 7-52, 55-58.
About the Guest Lecturer
Margo Anderson, University of Wisconsin – Milwaukee
This class coincides with FSRDC system’s annual conference. There will be no in-classroom activity at most sites on this day (please check with local coordinator). The content of this section will be discussed on Sept 21, 2017, so students should take the time to view the materials on edX during this week.
Health statistics, energy statistics, agricultural statistics, others. Registered-based statistics, organic data.
Erica Groshen, Cornell University, will take part in the discussion.
Brent Hueth, University of Wisconsin-Madison, will be discussing topics related to agricultural statistics.
- Health statistics (Lecture Notes: INFO7470-S7-Parker, Jennifer Parker (NCHS))
- Agricultural statistics (Lecture Notes: INFO7470-S7-DunnHueth, additional materials, INFO7470-S7-Migrant Farm Labor in the Census of Agriculture, Richard Dunn (University of Connecticut) and Brent Hueth (University of Wisconsin-Madison))
- EIA presentation: INFO7470-S9-EIA-Background-2016 (Jacob Bournazian (EIA))
- Register-based statistics: INFO7470-S9-Register-data
- Alternate data sources: INFO7470-S9-Organic-data
- Updates by Erica Groshen on working with BLS data: INFO7470 2017 Groshen BLS
This will be “flipped classroom” on Geographic Information Systems (GIS) – basic geocoding, geographic concepts, and other topics.
Michael Ratcliffe, U.S. Census Bureau
- Geography: INFO7470-S8-Census Geography Concepts
Flipped classrom about access to restricted access data. Students will be introduced to the research proposal mechanism of the Federal Statistical Research Data Center, including data from the Census Bureau, NCHS, and BLS.
Discussion will focus on how to access various restricted access data sets. Guest presenters may be present live in the videoconference classroom.
The presentation on replicable science is moved to
next week a later date.
The class is both flipped classroom and live presentation.
We discuss the need for and the requirements of replicable science (in general, and in restricted-access environments). This part is a live lecture by Lars Vilhuber.
Introduction to record linking
- What is record linking, what is it not, what is the theory?
- Record linking: applications and examples – How do you do it, what do you need, what are the possible complications?
- Examples of record linking
- INFO7470-S10-Primer_for_Programs (PDF) or (Powerpoint)
- Large-scale Data Linkage from Multiple Sources: Methodology and Research Challenges
John M. Abowd, U.S. Census Bureau and Cornell University, will lead the discussion.
- Formal models of edits and imputations
- Missing data overview
- Missing records – Frame or census – Survey
- Missing items
- Overview of different products
- Overview of methods
- Formal multiple imputation methods
- INFO7470 S11 -Statistical Tools Edit and Imputation (Powerpoint)
- INFO7470 S11 -Statistical Tools Edit and Imputation Examples
The lab (an edit and imputation exercise) is posted on the INFO7470x edX site. You will need to create a program, and upload the program (language of your choice) to edX. A toy example is illustrated in a video on the edX site, you can download the spreadsheet toy-example-imputation.xlsx here.
- Why must users of restricted-access data learn about confidentiality protection?
- What is statistical disclosure limitation?
- What are privacy-preserving data mining and differential privacy?
- Basic methods for disclosure avoidance (SDL)
- Rules and methods for model-based SDL
- SDL-based noise methods
- Synthetic data
- Differential privacy methods
- Part A: Spatial Analysis (Nicholas Nagle of University of Tennessee – Knoxville)
- Part B: Network Analysis (John Abowd, Cornell University)
Part A: Spatial Analysis
- Basic Geocoding
- Tools for Geocoding
- Analysis Methods
- Tools for Geographic Analysis
RequirementsAny student enrolled in a Ph.D. or Masters program at one of the participating universities may take this course. Students at Cornell register for INFO 7470 (or ECON7400/ ILRLE7400, identical). Some programming experience (in any statistical programming language) is required for some of the labs. Some statistical or econometric training is required for some of the lectures.
EnrollmentIn addition to local registration rules for each of the participating sites, all students will also register in a edx Edge class. The URL will be updated at a later time.
LecturesThe course has two types of lectures. The first half of the course (roughly Sessions 3-8) is in a "flipped classroom" setting, whereas the latter half is a more traditional lecture style. The type of lecture will be identified for each date on the course calendar.
- In-classroom discussions and additional materials and guest lectures will occur at the time and date listed on the course calendar.
- Lectures should be viewed on edx Edge prior to the classroom time
- Exercises and labs on edx Edge should be completed prior to the classroom time
- In-classroom time is expected to be shorter than the usual full length, presumably 1 hour, but dependent on classroom participation
- In-classroom lecture and discussion will occur at the time and date listed on the course calendar.
- The expected length corresponds to the listed time.
- Additional materials may be made available on edx Edge
- Exercises and labs will be uploaded and graded on edx Edge.