My publications can be found in a variety of places.
and of course listed (curated) in my bibliography. Below is a list of the latest documents on each of those sites (not all the publications are mine).
Documents on Cornell eCommons
Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Server
by Vilhuber, Lars on 13 August 2019 at 00:00
Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Server Vilhuber, Lars The SDS at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 7 years, 4 synthetic datasets were made available on the server, and over 120 users have accessed the server over that time period. This paper reports on outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as a access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics. Presentation made at the Conference on Current Trends in Survey Statistics 2019 at the Institute for Mathematical Sciences, National University of Singapore, Singapore, 13 – 16 August, 2019
Stepping-up: The Census Bureau Tries to Be a Good Data Steward in the 21stCentury
by Abowd, John M. on 4 March 2019 at 00:00
Stepping-up: The Census Bureau Tries to Be a Good Data Steward in the 21stCentury Abowd, John M. The Fundamental Law of Information Reconstruction, a.k.a. the Database Reconstruction Theorem, exposes a vulnerability in the way statistical agencies have traditionally published data. But it also exposes the same vulnerability for the way Amazon, Apple, Facebook, Google, Microsoft, Netflix, and other Internet giants publish data. We are all in this data-rich world together. And we all need to find solutions to the problem of how to publish information from these data while still providing meaningful privacy and confidentiality protections to the providers. Fortunately for the American public, the Census Bureau’s curation of their data is already regulated by a very strict law that mandates publication for statistical purposes only and in a manner that does not expose the data of any respondent–person, household or business–in a way that identifies that respondent as the source of specific data items. The Census Bureau has consistently interpreted that stricture on publishing identifiable data as governed by the laws of probability. An external user of Census Bureau publications should not be able to assert with reasonable certainty that particular data values were directly supplied by an identified respondent. Traditional methods of disclosure avoidance now fail because they are not able to formalize and quantify that risk. Moreover, when traditional methods are assessed using current tools, the relative certainty with which specific values can be associated with identifiable individuals turns out to be orders of magnitude greater than anticipated at the time the data were released. In light of these developments, the Census Bureau has committed to an open and transparent modernization of its data publishing systems using formal methods like differential privacy. The intention is to demonstrate that statistical data, fit for their intended uses, can be produced when the entire publication system is subject to a formal privacy-loss budget. To date, the team developing these systems–many of whom are in this room–has demonstrated that bounded \epsilon-differential privacy can be implemented for the data publications from the 2020 Census used to re-draw every legislative district in the nation (PL94-171 tables). That team has also developed methods for quantifying and displaying the system-wide trade-offs between the accuracy of those data and the privacy-loss budget assigned to the tabulations. Considering that work began in mid-2016 and that no organization anywhere in the world has yet deployed a full, central differential privacy system, this is already a monumental achievement. But it is only the tip of the iceberg in terms of the statistical products historically produced from a decennial census. Demographic profiles, based on the detailed tables traditionally published in summary files following the publication of redistricting data, have far more diverse uses than the redistricting data. Summarizing those use cases in a set of queries that can be answered with a reasonable privacy-loss budget is the next challenge. Internet giants, businesses and statistical agencies around the world should also step-up to these challenges. We can learn from, and help, each other enormously. Presented at the Simons Institute Workshop "Data Privacy: From Foundations to Applications." Program available here: https://simons.berkeley.edu/workshops/schedule/6281
Why the Economics Profession Cannot Cede the Discussion of Privacy Protection to Computer Scientists
by Abowd, John M. on 5 January 2019 at 00:00
Why the Economics Profession Cannot Cede the Discussion of Privacy Protection to Computer Scientists Abowd, John M.; Schmutte, Ian M.; Sexton, William N.; Vilhuber, Lars When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research. Presented at the Allied Social Science Association meeting 2019 in the session "The Future of Economic Research Under Rising Risks and Costs of Information Disclosure", Saturday, Jan. 5, 2019, 2:30 PM – 4:30 PM https://www.aeaweb.org/conference/2019/preliminary/851
The Reproducibility of Economics Research: A Case Study
by Kingi, Hautahi on 10 December 2018 at 00:00
The Reproducibility of Economics Research: A Case Study Kingi, Hautahi; Vilhuber, Lars; Herbert, Sylverie; Stanchi, Flavio Published reproductions or replications of economics research are rare. However, recent years have seen increased recognition of the important role of replication in the scientific endeavor. We describe and present the results of a large reproduction exercise in which we assess the reproducibility of research articles published in the American Economic Journal: Applied Economics over the last decade. 69 of 162 eligible replication attempts successfully replicated the article’s analysis 42.6%. A further 68 (42%) were at least partially successful. A total of 98 out of 303 (32.3%) relied on confidential or proprietary data, and were thus not reproducible by this project. We also conduct several bibliometric analyses of reproducible vs. non-reproducible articles.
Reproducibility Confidentiality Data Access
by Vilhuber, Lars on 1 November 2018 at 00:00
Reproducibility Confidentiality Data Access Vilhuber, Lars The recent concern about the reproducibility of research results has not yet been robustly incorporated into methods of providing and accessing administrative data, casting doubts on the validity of research based on such data. Reproducibility depends on disaggregating and exposing the multiple components of the research – data, software, workflows, and provenance – to other researchers and providing adequate metadata to make these components usable. The key worry is access: the authors of a study that uses administrative data often cannot themselves deposit the data with the journal, thereby impairing easy access to those data and consequently impeding reproducibility. This suggests a critical role for administrative data centers. We argue, that data held by ADRF do have attributes that lend themselves to reproducibility exercises, though this may, at present, not always be communicated correctly. We describe how ADRF can and should promote reproducibility through a number of components. Presented at the 2018 ADRF Network Research Conference.
Documents on DigitalCommons
Labor Dynamics Institute Recent documents in Labor Dynamics Institute
How Protective Are Synthetic Data?
by John Abowd et al. on 13 June 2019 at 17:37
This short paper provides a synthesis of the statistical disclosure limitation and computer science data privacy approaches to measuring the confidentiality protections provided by fully synthetic data. Since all elements of the data records in the release file derived from fully synthetic data are sampled from an appropriate probability distribution, they do not represent “real data,” but there is still a disclosure risk. In SDL this risk is summarized by the inferential disclosure probability. In privacy-protected database queries, this risk is measured by the differential privacy ratio. The two are closely related. This result (not new) is demonstrated and examples are provided from recent work
metajelo: A Metadata Package for Journals to Support External Linked Objects
by Carl Lagoze et al. on 11 April 2019 at 13:16
We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.
Why the Economics Profession Must Actively Participate in the Privacy Protection Debate
by John M. Abowd et al. on 5 March 2019 at 15:47
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
Understanding Database Reconstruction Attacks on Public Data
by Simson L. Garfinkel et al. on 26 November 2018 at 16:26
In 2020 the U.S. Census Bureau will conduct the Constitutionally mandated decennial Census of Population and Housing. Because a census involves collecting large amounts of private data under the promise of confidentiality, traditionally statistics are published only at high levels of aggregation. Published statistical tables are vulnerable to DRAs (database reconstruction attacks), in which the underlying microdata is recovered merely by finding a set of microdata that is consistent with the published statistical tabulations. A DRA can be performed by using the tables to create a set of mathematical constraints and then solving the resulting set of simultaneous equations. This article shows how such an attack can be addressed by adding noise to the published tabulations, so that the reconstruction no longer results in the original data.
The U.S. Census Bureau Adopts Differential Privacy
by John M. Abowd on 21 November 2018 at 18:19
The U.S. Census Bureau announced, via its Scientific Advisory Committee, that it would protect the publications of the 2018 End-to-End Census Test (E2E) using differential privacy. The E2E test is a dress rehearsal for the 2020 Census, the constitutionally mandated enumeration of the population used to reapportion the House of Representatives and redraw every legislative district in the country. Systems that perform successfully in the E2E test are then used in the production of the 2020 Census. Motivation: The Census Bureau conducted internal research that confirmed that the statistical disclosure limitation systems used for the 2000 and 2010 Censuses had serious vulnerabilities that were exposed by the Dinur and Nissim (2003) database reconstruction theorem. We designed a differentially private publication system that directly addressed these vulnerabilities while preserving the fitness for use of the core statistical products. Problem statement: Designing and engineering production differential privacy systems requires two primary components: (1) inventing and constructing algorithms that deliver maximum accuracy for a given privacy-loss budget and (2) insuring that the privacy-loss budget can be directly controlled by the policy-makers who must choose an appropriate point on the accuracy-privacy-loss tradeoff. The first problem lies in the domain of computer science. The second lies in the domain of economics. Approach: The algorithms under development for the 2020 Census focus on the data used to draw legislative districts and to enforce the 1965 Voting Rights Act (VRA). These algorithms efficiently distribute the noise injected by differential privacy. The Data Stewardship Executive Policy Committee selects the privacy-loss parameter after reviewing accuracy-privacy-loss graphs.