Event Calendar

Apr
6
Thu
2017
Presentation at NADDI 2017 @ ILR Conference Center
Apr 6 – Apr 7 all-day
Print Friendly, PDF & Email

The Cornell NCRN node presents results from the past 5 years on CED²AR, their metadata editor and presentation tool.
 
Conference website: http://naddiconf.org/2017/

May
8
Mon
2017
Cornell-Census-NSF–Sloan Workshop On Practical Privacy 2017 @ U.S. Census Bureau
May 8 all-day
Print Friendly, PDF & Email

The goal of the workshop is to

Contemplate practical implementations of privacy preserving statistical methods by drawing together expertise of academic and governmental researchers
Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.

The workshop is organized by the Labor Dynamics Institute and Cornell NCRN node. Funding for the workshop is provided by the National Science Foundation (CNS-1012593) and the Alfred P. Sloan Foundation.
This is a follow-up to the NSF–Sloan Workshop On Practical Privacy 2016.
Conference proceedings: Vilhuber, Lars, and Ian Schmutte. 2017. “Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy”, Labor Dynamics Institute Document 43, http://digitalcommons.ilr.cornell.edu/ldi/43 or http://hdl.handle.net/1813/52473.

Agenda

Start
Duration
Topic

9:00
(0h30)
Welcome (Lars Vilhuber), housekeeping plan,
Associate Director’s remarks (John M. Abowd)

9:30
(1h15)
2020 Census: Implementation issues using the redistricting data

(20min)
Presentation of current work and issues (Phil Leclerc)

(45min)
Discussion

(10min)
Summary

10:15
(0h10)
Coffee break

10:20
(1h15)
ACS and 2020 Census: Privacy for households or persons?

(20min)
Presentation of current work and issues (Jerry Reiter)

(30min)
Discussion

(10min)
Summary

11:35
(0h55)
Lunch (please choose lunch options here)

12:30
(1h15)
Demand for Privacy

(20min)
Presentation of current work and issues (Jenny Childs, Ian Schmutte)

(45min)
Discussion

(10min)
Summary

1:45
(0h15)
Coffee break

2:00
(1h10)
Economic Census 2017

(30min)
Presentation of current work and issues (Jenny Thompson)

(30min)
Discussion

(10min)
Summary

3:10

Workshop ends

The detailed program can be found here.

May
9
Tue
2017
Synthetic Longitudinal Business Data International User Seminar @ National Academy of Sciences
May 9 @ 09:00 – 14:00
Print Friendly, PDF & Email

In this seminar, we discuss with interested parties the conditions necessary to implement the SynLBD approach, with the goal of providing other statistical agencies a straightforward toolkit to implement the same procedure on their own data. Our hope is that by implementing similar procedures on comparable business microdata, new research both within and across countries can be enabled. The ideal end result is a series of country-specific datasets on establishments and/or firms available within the same computing environment. We discuss the data and software requirements for the lowest-cost approach, the disclosure protection statistics already implemented that can be used to achieve release of the data in this  way, the validation procedures that an agency should agree to, and the likely cost of maintaining such procedures. The seminar brings together academics working on cutting-edge methods for the protection of privacy in statistical databases, and researchers and implementers at statistical agencies that have started or are interested in starting a similar project.
Five sessions will touch on the full lifecycle of a SynLBD development and implementation, and will follow the same pattern. We will first discuss existing implementations and experiences, and will then as a group discuss issues as they pertain to the broader community. Emphasis should be on discussing open issues, specific solutions to specific problems. Proceedings will be published later.
For more details, please see the full agenda.
Proceedings
Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M., 2017. “Proceedings from the Synthetic LBD International Seminar”, Labor Dynamics Institute Document 44, available at http://digitalcommons.ilr.cornell.edu/ldi/44/  or http://hdl.handle.net/1813/52472
Documents
Overview of the SynLBD methodology
Link to presentation. Contains excerpts from

S. Kinney, “Presentation: Synthetic Data Generation for Firm Links,” NSF Census Research Network – NCRN-Cornell, 1813:50054, 2016. [Abstract] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version–-now available for public use–-of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

@TechReport{kinney-2016-ecommons,
title = {Presentation: Synthetic Data Generation for Firm Links},
author = {Kinney, Saki},
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:50054},
Abstract = {In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
keywords = {confidentiality; US Longitudinal Business Database; synthetic data},
owner = {vilhuber},
URL = {http://hdl.handle.net/1813/50054}
}

Inputs to the SynLBD process
Link to presentation. Based on Drechsler and Vilhuber (2014).
Confidentiality of the SynLBD
Link to presentation. Contains excerpts from 

S. Kinney, “Presentation: Synthetic Data Generation for Firm Links,” NSF Census Research Network – NCRN-Cornell, 1813:50054, 2016. [Abstract] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version–-now available for public use–-of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

@TechReport{kinney-2016-ecommons,
title = {Presentation: Synthetic Data Generation for Firm Links},
author = {Kinney, Saki},
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:50054},
Abstract = {In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
keywords = {confidentiality; US Longitudinal Business Database; synthetic data},
owner = {vilhuber},
URL = {http://hdl.handle.net/1813/50054}
}

Validation Servers
Link to presentation. Contains excerpts from

L. Vilhuber and J. M. Abowd, “Presentation: SOLE 2016: Usage and outcomes of the Synthetic Data Server,” NSF Census Research Network – NCRN-Cornell, 1813:43883, 2016. [Abstract] [URL] [Bibtex]
The Synthetic Data Server (SDS) at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 5 years, 4 synthetic datasets were made available on the server, and over 100 users have accessed the server over that time period. This paper reports on interim outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as an access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics.

@TechReport{Vilhuber2016-cy,
title = ‘Presentation: {SOLE} 2016: Usage and outcomes of the Synthetic
Data Server’,
author = ‘Vilhuber, Lars and Abowd, John M’,
abstract = ‘The Synthetic Data Server (SDS) at Cornell University was set
up to provide early access to new synthetic data products by
the U.S. Census Bureau. These datasets are made available to
interested researchers in a controlled environment, prior to a
more generalized release. Over the past 5 years, 4 synthetic
datasets were made available on the server, and over 100 users
have accessed the server over that time period. This paper
reports on interim outcomes of the activity: results of
validation requests from a user perspective, functioning of the
feedback loop due to validation and user input, and the role of
the SDS as an access gateway to and educational tool for other
mechanisms of accessing detailed person, household,
establishment, and firm statistics.’,
conference = ‘SOLE 2016’,
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:43883},
URL = {http://hdl.handle.net/1813/43883}
}

Other recommended readings

L. Vilhuber, J. M. Abowd, and J. P. Reiter, “Synthetic establishment microdata around the world,” Statistical Journal of the International Association for Official Statistics, vol. 32, iss. 1, pp. 65-68, 2016. [Abstract] [DOI] [Bibtex]
In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.

@article{VilhuberAbowdReiter:Synthetic:SJIAOS:2016,
title = {Synthetic establishment microdata around the world},
journal = {Statistical Journal of the International Association for Official Statistics},
author = {Lars Vilhuber and John M. Abowd and Jerome P. Reiter},
year=2016,
volume={32},
number={1},
pages={65-68},
doi={10.3233/SJI-160964},
abstract={In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.},
}

S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, “Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database,” International Statistical Review, vol. 79, iss. 3, p. 362–384, 2011. [Abstract] [DOI] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

@ARTICLE{Kinney2011-ic,
title = ‘Towards Unrestricted Public Use Business Microdata: The
Synthetic Longitudinal Business Database’,
author = ‘Kinney, Satkartar K and Reiter, Jerome P and Reznek, Arnold P
and Miranda, Javier and Jarmin, Ron S and Abowd, John M’,
journal = {International Statistical Review},
year = {2011},
volume = {79},
pages = {362–384},
number = {3},
doi = {10.1111/j.1751-5823.2011.00153.x},
issn = {1751-5823},
keywords = {Economic census, data confidentiality, synthetic data, disclosure
limitation},
owner = {vilhuber},
publisher = {Blackwell Publishing Ltd},
timestamp = {2012.09.04},
abstract = {In most countries, national statistical agencies do not release establishment-level
business microdata, because doing so represents too large a risk
to establishments’ confidentiality. One approach with the potential
for overcoming these risks is to release synthetic data; that is,
the released establishment data are simulated from statistical models
designed to mimic the distributions of the underlying real microdata.
In this article, we describe an application of this strategy to create
a public use file for the Longitudinal Business Database, an annual
economic census of establishments in the United States comprising
more than 20 million records dating back to 1976. The U.S. Bureau
of the Census and the Internal Revenue Service recently approved
the release of these synthetic microdata for public use, making the
synthetic Longitudinal Business Database the first-ever business
microdata set publicly released in the United States. We describe
how we created the synthetic data, evaluated analytical validity,
and assessed disclosure risk.},
url = {http://dx.doi.org/10.1111/j.1751-5823.2011.00153.x}
}

J. Drechsler and L. Vilhuber, “A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database,” Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, 2014. [Abstract] [DOI] [URL] [Bibtex]
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD – the German Longitudinal Business Database (GLBD) – that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

@Article{SJIAOS-2014b,
Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
Author = {J{‘o}rg Drechsler and Lars Vilhuber},
Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
Year = {2014},
Volume = {30},
Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD – the German Longitudinal Business Database (GLBD) – that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
DOI = {10.3233/SJI-140812},
Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
Owner = {vilhuber},
Timestamp = {2014.03.24},
URL = {http://iospress.metapress.com/content/X415V18331Q33150}
}

Funding
Funding for the workshop is provided by the National Science Foundation (CNS-1012593, SES-1131848) and the Alfred P. Sloan Foundation.  The organizers thank the National Academies’ Committee on National Statistics for hosting the seminar.

May
14
Sun
2017
Presentation of “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics” at SIGMOD 2017 @ Hilton Chicago
May 14 – May 16 all-day
Print Friendly, PDF & Email

Our paper “Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics” (Samuel Haney, Ashwin Machanavajjhala, John Abowd, Matthew Graham, Mark Kutzbach and Lars Vilhuber) will be presented at SIGMOD 2017.
(link to preprint forthcoming)
The conference: The annual ACM SIGMOD/PODS conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. The conference includes a fascinating technical program with research and industrial talks, tutorials, demos, and focused workshops. It also hosts a poster session to learn about innovative technology, an industrial exhibition to meet companies and publishers, and a careers-in-industry panel with representatives from leading companies.
Tickets: http://sigmod2017.org/.

Jun
7
Wed
2017
Vilhuber presents at Seminário DATAFIRM LatAm – Datos administrativos para la investigación sobre productividad @ Banco de la Nación Argentina
Jun 7 all-day
Print Friendly, PDF & Email

Vilhuber presents on “Confidentiality Protection and Physical Safeguards“. Presentation file is available at http://hdl.handle.net/1813/51487

Jun
8
Thu
2017
Workshop: Herramientas prácticas para la apertura y uso seguro de microdatos de firmas @ Offices of CAF in Argentina
Jun 8 all-day
Print Friendly, PDF & Email

Vilhuber participates in a workshop by Latin American government and research analysts and data providers, regarding the potential secure use of firm microdata.

Jun
21
Wed
2017
Vilhuber presents at Workshop on Transparency and Reproducibility in Federal Statistics @ The National Academies of Science, Engineering, and Medicine
Jun 21 all-day
Print Friendly, PDF & Email

Vilhuber presents on “Making confidential data part of reproducible research.” Presentation file to come.

Jul
18
Tue
2017
Synthetic Datasets for Statistical Disclosure Control – Research and Applications Around the World @ Palais des Congrès - Mansouri Eddahbi
Jul 18 @ 10:30 – 12:30
Print Friendly, PDF & Email

Together with a few others from around the world, Lars Vilhuber will be presenting on results from a synthetic data validation cycle at the International Statistical Institute’s World Statistical Congress.

Oct
19
Thu
2017
INFO7470: Michael Ratcliffe: Maintaining an Accurate Address List: Reengineering Address Canvassing through the Use of Multiple Sources and Methods @ Ives 109
Oct 19 @ 16:25 – 18:00
Print Friendly, PDF & Email

Michael Ratcliffe will be presenting “Maintaining an Accurate Address List: Reengineering Address Canvassing through the Use of Multiple Sources and Methods” and discussing topics related to the definitions, in the past, now, and in the future, of geography for census data collection purposes. This presentation is part of INFO7470 (https://www.vrdc.cornell.edu/info747x/) but all are welcome.

Oct
26
Thu
2017
INFO7470: Restricted Access Data in the FSRDC system @ Ives 109
Oct 26 @ 16:25 – 18:00
Print Friendly, PDF & Email

Barbara Downs (U.S. Census Bureau) will be discussing how best to access data in the FSRDC system. This presentation is part of INFO7470 (https://www.vrdc.cornell.edu/info747x/) but all are welcome.

Nov
1
Wed
2017
Joint LDI-CISER-CPC Seminar: Jason Fields (U.S. Census Bureau) @ TBD
Nov 1 all-day
Print Friendly, PDF & Email

Time: TBD
Measures and Content for Studying Family Living Arrangements and Child Well-Being from SIPP 2014

CANCELLED: Jason Fields (U.S. Census Bureau) @ MVR G87
Nov 1 @ 13:15 – 14:45
Print Friendly, PDF & Email

Measures and Content for Studying Family Living Arrangements and Child Well-Being from SIPP 2014

Nov
8
Wed
2017
Joint LDI-CISER-CPC Seminar: Jonathan Vespa (U.S. Census Bureau) @ TBD
Nov 8 all-day
Print Friendly, PDF & Email

Time TBD
Joint LDI-CISER-CPC Seminar: Old Housing, New Needs: Are US Homes Ready for an Aging Population

LDI-CISER-CPC Seminar: Jonathan Vespa (U.S. Census Bureau) @ MVR G87
Nov 8 @ 13:15 – 14:45
Print Friendly, PDF & Email

Joint LDI-CISER-CPC Seminar: Old Housing, New Needs: Are US Homes Ready for an Aging Population

Nov
29
Wed
2017
Joint LDI-CISER-Macroeconomics Seminar: Larry Warren @ Ives 115
Nov 29 @ 11:45 – 13:15
Print Friendly, PDF & Email

Part Time Employment and Firm-level Labor Demand over the Business Cycle.

Dec
6
Wed
2017
Joint LDI-CISER-Macroeconomics Seminar: Henry Hyatt @ Ives 115
Dec 6 @ 13:00 – 14:30
Print Friendly, PDF & Email

Cyclical Labor Market Sorting

Dec
13
Wed
2017
Joint LDI-CISER-Industrial Organization Workshop: Emek Basker @ Ives 115
Dec 13 @ 13:00 – 14:30
Print Friendly, PDF & Email

Upstream, Downstream: Diffusion and Impact of the Universal Product Code