Event Calendar

Sep
17
Wed
2014
Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results @ Privacy in Statistical Databases (PSD) 2014
Sep 17 @ 11:00 – 11:30
Print Friendly, PDF & Email

“Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results“. Javier Miranda (U.S. Census Bureau) and Lars Vilhuber (NCRN, Cornell University)
 
Abstract: “The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.”
 
Proceedings are available at Springer. Working paper in our eCommons repository.

Aug
11
Tue
2015
JSM 2015: Synthetic Longitudinal Business Databases for International Comparisons @ Joint Statistical Meetings (JSM) 2015
Aug 11 @ 14:00 – 15:50
Print Friendly, PDF & Email

“Synthetic Longitudinal Business Databases for International Comparisons” — Joerg Drechsler, Institute for Employment Research ; Lars Vilhuber, Cornell University
International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.
http://www.amstat.org/meetings/JSM/2015/onlineprogram/AbstractDetails.cfm?abstractid=315820

Aug
13
Thu
2015
JSM 2015: Assessing the Data Quality of Public Use Tabulations Produced from Synthetic Data: Synthetic Business Dynamics Statistics @ Joint Statistical Meetings (JSM) 2015
Aug 13 @ 08:30 – 10:20
Print Friendly, PDF & Email

“Assessing the Data Quality of Public Use Tabulations Produced from Synthetic Data: Synthetic Business Dynamics Statistics“, Lars Vilhuber, Cornell University; Javier Miranda, U.S. Census Bureau
Discussant: John Abowd, Cornell University
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
http://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=316288

Oct
6
Tue
2015
Vilhuber @ UNECE 2015: Using partially synthetic micro data to protect sensitive cells in business statistics @ UNECE Statistical Data Confidentiality Work Session
Oct 6 all-day
Print Friendly, PDF & Email

“Using partially synthetic microdata to protect sensitive cells in business statistics,” Lars Vilhuber (NCRN, Cornell University), Javier Miranda (U.S. Census Bureau). This is an updated version of the presentation made at JSM 2015.
 

Oct
23
Fri
2015
Vilhuber @ CAED 2015: “Usage and outcomes of the Synthetic Data Server” @ Comparative Analysis of Enterprise Data (CAED) 2015 Conference
Oct 23 @ 08:30 – Oct 25 @ 14:15
Print Friendly, PDF & Email

“Usage and outcomes of the Synthetic Data Server,” Lars Vilhuber (NCRN, Cornell University) and John Abowd (NCRN, Cornell University) 

The Synthetic Data Server (SDS) at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 5 years, 4 synthetic datasets were made available on the server, and over 100 users have accessed the server over that time period. This paper reports on interim outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as a access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics.

Tickets: http://caed2015.sabanciuniv.edu/registration-form.

Nov
30
Wed
2016
Lars Vilhuber: “Disclosure Limitation and Confidentiality Protection in Linked Data” @ Centre interuniversitaire de recherche en analyse des organisations
Nov 30 @ 08:30 – 14:00
Print Friendly, PDF & Email

Lars Vilhuber speaks about “Disclosure Limitation and Confidentiality Protection in Linked Data” at the Center for Interuniversity Research and Analysis of Organizations‘s conference on “Facilitate the access to Quebec data: How and to what ends?” The conference is jointly organized with the Quebec inter-University Centre for Social Statistics (QICSS). The presentation relies on joint work with John M. Abowd and Ian M. Schmutte.
[Presentation]

May
9
Tue
2017
Synthetic Longitudinal Business Data International User Seminar @ National Academy of Sciences
May 9 @ 09:00 – 14:00
Print Friendly, PDF & Email

In this seminar, we discuss with interested parties the conditions necessary to implement the SynLBD approach, with the goal of providing other statistical agencies a straightforward toolkit to implement the same procedure on their own data. Our hope is that by implementing similar procedures on comparable business microdata, new research both within and across countries can be enabled. The ideal end result is a series of country-specific datasets on establishments and/or firms available within the same computing environment. We discuss the data and software requirements for the lowest-cost approach, the disclosure protection statistics already implemented that can be used to achieve release of the data in this  way, the validation procedures that an agency should agree to, and the likely cost of maintaining such procedures. The seminar brings together academics working on cutting-edge methods for the protection of privacy in statistical databases, and researchers and implementers at statistical agencies that have started or are interested in starting a similar project.
Five sessions will touch on the full lifecycle of a SynLBD development and implementation, and will follow the same pattern. We will first discuss existing implementations and experiences, and will then as a group discuss issues as they pertain to the broader community. Emphasis should be on discussing open issues, specific solutions to specific problems. Proceedings will be published later.
For more details, please see the full agenda.
Proceedings
Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M., 2017. “Proceedings from the Synthetic LBD International Seminar”, Labor Dynamics Institute Document 44, available at http://digitalcommons.ilr.cornell.edu/ldi/44/  or http://hdl.handle.net/1813/52472
Documents
Overview of the SynLBD methodology
Link to presentation. Contains excerpts from

S. Kinney, “Presentation: Synthetic Data Generation for Firm Links,” NSF Census Research Network – NCRN-Cornell, 1813:50054, 2016. [Abstract] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version–-now available for public use–-of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

@TechReport{kinney-2016-ecommons,
title = {Presentation: Synthetic Data Generation for Firm Links},
author = {Kinney, Saki},
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:50054},
Abstract = {In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
keywords = {confidentiality; US Longitudinal Business Database; synthetic data},
owner = {vilhuber},
URL = {http://hdl.handle.net/1813/50054}
}

Inputs to the SynLBD process
Link to presentation. Based on Drechsler and Vilhuber (2014).
Confidentiality of the SynLBD
Link to presentation. Contains excerpts from 

S. Kinney, “Presentation: Synthetic Data Generation for Firm Links,” NSF Census Research Network – NCRN-Cornell, 1813:50054, 2016. [Abstract] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version–-now available for public use–-of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

@TechReport{kinney-2016-ecommons,
title = {Presentation: Synthetic Data Generation for Firm Links},
author = {Kinney, Saki},
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:50054},
Abstract = {In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models designed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U.S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal census of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This paper describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
keywords = {confidentiality; US Longitudinal Business Database; synthetic data},
owner = {vilhuber},
URL = {http://hdl.handle.net/1813/50054}
}

Validation Servers
Link to presentation. Contains excerpts from

L. Vilhuber and J. M. Abowd, “Presentation: SOLE 2016: Usage and outcomes of the Synthetic Data Server,” NSF Census Research Network – NCRN-Cornell, 1813:43883, 2016. [Abstract] [URL] [Bibtex]
The Synthetic Data Server (SDS) at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 5 years, 4 synthetic datasets were made available on the server, and over 100 users have accessed the server over that time period. This paper reports on interim outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as an access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics.

@TechReport{Vilhuber2016-cy,
title = ‘Presentation: {SOLE} 2016: Usage and outcomes of the Synthetic
Data Server’,
author = ‘Vilhuber, Lars and Abowd, John M’,
abstract = ‘The Synthetic Data Server (SDS) at Cornell University was set
up to provide early access to new synthetic data products by
the U.S. Census Bureau. These datasets are made available to
interested researchers in a controlled environment, prior to a
more generalized release. Over the past 5 years, 4 synthetic
datasets were made available on the server, and over 100 users
have accessed the server over that time period. This paper
reports on interim outcomes of the activity: results of
validation requests from a user perspective, functioning of the
feedback loop due to validation and user input, and the role of
the SDS as an access gateway to and educational tool for other
mechanisms of accessing detailed person, household,
establishment, and firm statistics.’,
conference = ‘SOLE 2016’,
institution = {NSF Census Research Network – NCRN-Cornell },
year = {2016},
number = {1813:43883},
URL = {http://hdl.handle.net/1813/43883}
}

Other recommended readings

L. Vilhuber, J. M. Abowd, and J. P. Reiter, “Synthetic establishment microdata around the world,” Statistical Journal of the International Association for Official Statistics, vol. 32, iss. 1, pp. 65-68, 2016. [Abstract] [DOI] [Bibtex]
In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.

@article{VilhuberAbowdReiter:Synthetic:SJIAOS:2016,
title = {Synthetic establishment microdata around the world},
journal = {Statistical Journal of the International Association for Official Statistics},
author = {Lars Vilhuber and John M. Abowd and Jerome P. Reiter},
year=2016,
volume={32},
number={1},
pages={65-68},
doi={10.3233/SJI-160964},
abstract={In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.},
}

S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, “Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database,” International Statistical Review, vol. 79, iss. 3, p. 362–384, 2011. [Abstract] [DOI] [URL] [Bibtex]
In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

@ARTICLE{Kinney2011-ic,
title = ‘Towards Unrestricted Public Use Business Microdata: The
Synthetic Longitudinal Business Database’,
author = ‘Kinney, Satkartar K and Reiter, Jerome P and Reznek, Arnold P
and Miranda, Javier and Jarmin, Ron S and Abowd, John M’,
journal = {International Statistical Review},
year = {2011},
volume = {79},
pages = {362–384},
number = {3},
doi = {10.1111/j.1751-5823.2011.00153.x},
issn = {1751-5823},
keywords = {Economic census, data confidentiality, synthetic data, disclosure
limitation},
owner = {vilhuber},
publisher = {Blackwell Publishing Ltd},
timestamp = {2012.09.04},
abstract = {In most countries, national statistical agencies do not release establishment-level
business microdata, because doing so represents too large a risk
to establishments’ confidentiality. One approach with the potential
for overcoming these risks is to release synthetic data; that is,
the released establishment data are simulated from statistical models
designed to mimic the distributions of the underlying real microdata.
In this article, we describe an application of this strategy to create
a public use file for the Longitudinal Business Database, an annual
economic census of establishments in the United States comprising
more than 20 million records dating back to 1976. The U.S. Bureau
of the Census and the Internal Revenue Service recently approved
the release of these synthetic microdata for public use, making the
synthetic Longitudinal Business Database the first-ever business
microdata set publicly released in the United States. We describe
how we created the synthetic data, evaluated analytical validity,
and assessed disclosure risk.},
url = {http://dx.doi.org/10.1111/j.1751-5823.2011.00153.x}
}

J. Drechsler and L. Vilhuber, “A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database,” Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, 2014. [Abstract] [DOI] [URL] [Bibtex]
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD – the German Longitudinal Business Database (GLBD) – that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

@Article{SJIAOS-2014b,
Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
Author = {J{‘o}rg Drechsler and Lars Vilhuber},
Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
Year = {2014},
Volume = {30},
Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD – the German Longitudinal Business Database (GLBD) – that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
DOI = {10.3233/SJI-140812},
Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
Owner = {vilhuber},
Timestamp = {2014.03.24},
URL = {http://iospress.metapress.com/content/X415V18331Q33150}
}

Funding
Funding for the workshop is provided by the National Science Foundation (CNS-1012593, SES-1131848) and the Alfred P. Sloan Foundation.  The organizers thank the National Academies’ Committee on National Statistics for hosting the seminar.

Jun
8
Thu
2017
Workshop: Herramientas prácticas para la apertura y uso seguro de microdatos de firmas @ Offices of CAF in Argentina
Jun 8 all-day
Print Friendly, PDF & Email

Vilhuber participates in a workshop by Latin American government and research analysts and data providers, regarding the potential secure use of firm microdata.

Jul
18
Tue
2017
Synthetic Datasets for Statistical Disclosure Control – Research and Applications Around the World @ Palais des Congrès - Mansouri Eddahbi
Jul 18 @ 10:30 – 12:30
Print Friendly, PDF & Email

Together with a few others from around the world, Lars Vilhuber will be presenting on results from a synthetic data validation cycle at the International Statistical Institute’s World Statistical Congress.