reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control

Authors: Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, Martin J. Wainwright

NeurIPS 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
Researcher Affiliation	Academia	Fanny Yang Dept. of EECS, U.C. Berkeley EMAIL Aaditya Ramdas Dept. of EECS and Statistics, U.C. Berkeley EMAIL Kevin Jamieson Allen School of CSE, U. of Washington EMAIL Martin Wainwright Dept. of EECS and Statistics, U.C. Berkeley EMAIL
Pseudocode	Yes	Procedure 1 MAB-FDR Meta algorithm skeleton. Algorithm 1 Best-arm identiﬁcation with a control arm for conﬁdence δ and precision ϵ. Procedure 2 MAB-LORD: best-arm identiﬁcation with online FDR control.
Open Source Code	Yes	The code for reproducing all experiments and plots in this paper is publicly available at https://github.com/fanny-yang/MABFDR
Open Datasets	Yes	Our experiments are run on artiﬁcial data with Gaussian/Bernoulli draws and real-world Bernoulli draws from the New Yorker Cartoon Caption contest. We have access to 1000 such contests over a period of 4 years.
Dataset Splits	Yes	In all simulations, 60% of all the hypotheses are true nulls, and their indices are chosen uniformly. The results in Section 4 are based on two different experimental settings: (i) an independent setting where we simulate K = 50 arms for each hypothesis, where we chose 60% of hypotheses to be true nulls and for the remaining 40% (non-nulls) we chose µi for the best alternative randomly in [0.05, 0.2] and other alternatives randomly in [0.0, 0.1]. (ii) a dependent setting (New Yorker data) where the alternatives are not chosen independently. For all results, we average over 100 repetitions.
Hardware Specification	No	The paper does not specify any hardware details like CPU models, GPU models, or memory specifications used for running experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	Unless otherwise noted, we set ϵ = 0 in all of our simulations to focus on the main ideas and keep the discussion concise. γj = 0.07 log(j 2) / je log j as in [4]. (i) an independent setting where we simulate K = 50 arms for each hypothesis, where we chose 60% of hypotheses to be true nulls and for the remaining 40% (non-nulls) we chose µi for the best alternative randomly in [0.05, 0.2] and other alternatives randomly in [0.0, 0.1]. For all results, we average over 100 repetitions.