Preserving Statistical Validity in Adaptive Data Analysis

Dwork, Cynthia; Feldman, Vitaly; Hardt, Moritz; Pitassi, Toniann; Reingold, Omer; Roth, Aaron

Computer Science > Machine Learning

arXiv:1411.2664 (cs)

[Submitted on 10 Nov 2014 (v1), last revised 2 Mar 2016 (this version, v3)]

Title:Preserving Statistical Validity in Adaptive Data Analysis

Authors:Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth

View PDF

Abstract:A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses.
In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of $m$ adaptively chosen functions on an unknown distribution given $n$ random samples.
We show that, surprisingly, there is a way to estimate an exponential in $n$ number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

Comments:	Updated related work with recent developments
Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1411.2664 [cs.LG]
	(or arXiv:1411.2664v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1411.2664

Submission history

From: Vitaly Feldman [view email]
[v1] Mon, 10 Nov 2014 23:44:49 UTC (36 KB)
[v2] Thu, 23 Apr 2015 20:57:38 UTC (38 KB)
[v3] Wed, 2 Mar 2016 07:04:07 UTC (39 KB)

Computer Science > Machine Learning

Title:Preserving Statistical Validity in Adaptive Data Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Preserving Statistical Validity in Adaptive Data Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators