My publications can be found in a variety of places.
- Cornell eCommons
- ILR’s DigitalCommons
- Github here, here, and here
and of course listed (curated) in my bibliography. Below is a list of the latest documents on each of those sites (not all the publications are mine).
Documents on Cornell eCommons
- Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Serverby Vilhuber, Lars on 13 August 2019 at 00:00
Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Server Vilhuber, Lars The SDS at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 7 years, 4 synthetic datasets were made available on the server, and over 120 users have accessed the server over that time period. This paper reports on outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as a access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics. Presentation made at the Conference on Current Trends in Survey Statistics 2019 at the Institute for Mathematical Sciences, National University of Singapore, Singapore, 13 – 16 August, 2019
- Stepping-up: The Census Bureau Tries to Be a Good Data Steward in the 21stCenturyby Abowd, John M. on 4 March 2019 at 00:00
Stepping-up: The Census Bureau Tries to Be a Good Data Steward in the 21stCentury Abowd, John M. The Fundamental Law of Information Reconstruction, a.k.a. the Database Reconstruction Theorem, exposes a vulnerability in the way statistical agencies have traditionally published data. But it also exposes the same vulnerability for the way Amazon, Apple, Facebook, Google, Microsoft, Netflix, and other Internet giants publish data. We are all in this data-rich world together. And we all need to find solutions to the problem of how to publish information from these data while still providing meaningful privacy and confidentiality protections to the providers. Fortunately for the American public, the Census Bureau’s curation of their data is already regulated by a very strict law that mandates publication for statistical purposes only and in a manner that does not expose the data of any respondent–person, household or business–in a way that identifies that respondent as the source of specific data items. The Census Bureau has consistently interpreted that stricture on publishing identifiable data as governed by the laws of probability. An external user of Census Bureau publications should not be able to assert with reasonable certainty that particular data values were directly supplied by an identified respondent. Traditional methods of disclosure avoidance now fail because they are not able to formalize and quantify that risk. Moreover, when traditional methods are assessed using current tools, the relative certainty with which specific values can be associated with identifiable individuals turns out to be orders of magnitude greater than anticipated at the time the data were released. In light of these developments, the Census Bureau has committed to an open and transparent modernization of its data publishing systems using formal methods like differential privacy. The intention is to demonstrate that statistical data, fit for their intended uses, can be produced when the entire publication system is subject to a formal privacy-loss budget. To date, the team developing these systems–many of whom are in this room–has demonstrated that bounded \epsilon-differential privacy can be implemented for the data publications from the 2020 Census used to re-draw every legislative district in the nation (PL94-171 tables). That team has also developed methods for quantifying and displaying the system-wide trade-offs between the accuracy of those data and the privacy-loss budget assigned to the tabulations. Considering that work began in mid-2016 and that no organization anywhere in the world has yet deployed a full, central differential privacy system, this is already a monumental achievement. But it is only the tip of the iceberg in terms of the statistical products historically produced from a decennial census. Demographic profiles, based on the detailed tables traditionally published in summary files following the publication of redistricting data, have far more diverse uses than the redistricting data. Summarizing those use cases in a set of queries that can be answered with a reasonable privacy-loss budget is the next challenge. Internet giants, businesses and statistical agencies around the world should also step-up to these challenges. We can learn from, and help, each other enormously. Presented at the Simons Institute Workshop "Data Privacy: From Foundations to Applications." Program available here: https://simons.berkeley.edu/workshops/schedule/6281
- Why the Economics Profession Cannot Cede the Discussion of Privacy Protection to Computer Scientistsby Abowd, John M. on 5 January 2019 at 00:00
Why the Economics Profession Cannot Cede the Discussion of Privacy Protection to Computer Scientists Abowd, John M.; Schmutte, Ian M.; Sexton, William N.; Vilhuber, Lars When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research. Presented at the Allied Social Science Association meeting 2019 in the session "The Future of Economic Research Under Rising Risks and Costs of Information Disclosure", Saturday, Jan. 5, 2019, 2:30 PM – 4:30 PM https://www.aeaweb.org/conference/2019/preliminary/851
- The Reproducibility of Economics Research: A Case Studyby Kingi, Hautahi on 10 December 2018 at 00:00
The Reproducibility of Economics Research: A Case Study Kingi, Hautahi; Vilhuber, Lars; Herbert, Sylverie; Stanchi, Flavio Published reproductions or replications of economics research are rare. However, recent years have seen increased recognition of the important role of replication in the scientific endeavor. We describe and present the results of a large reproduction exercise in which we assess the reproducibility of research articles published in the American Economic Journal: Applied Economics over the last decade. 69 of 162 eligible replication attempts successfully replicated the article’s analysis 42.6%. A further 68 (42%) were at least partially successful. A total of 98 out of 303 (32.3%) relied on confidential or proprietary data, and were thus not reproducible by this project. We also conduct several bibliometric analyses of reproducible vs. non-reproducible articles.
- Reproducibility Confidentiality Data Accessby Vilhuber, Lars on 1 November 2018 at 00:00
Reproducibility Confidentiality Data Access Vilhuber, Lars The recent concern about the reproducibility of research results has not yet been robustly incorporated into methods of providing and accessing administrative data, casting doubts on the validity of research based on such data. Reproducibility depends on disaggregating and exposing the multiple components of the research – data, software, workflows, and provenance – to other researchers and providing adequate metadata to make these components usable. The key worry is access: the authors of a study that uses administrative data often cannot themselves deposit the data with the journal, thereby impairing easy access to those data and consequently impeding reproducibility. This suggests a critical role for administrative data centers. We argue, that data held by ADRF do have attributes that lend themselves to reproducibility exercises, though this may, at present, not always be communicated correctly. We describe how ADRF can and should promote reproducibility through a number of components. Presented at the 2018 ADRF Network Research Conference.