reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Design and Analysis of the NIPS 2016 Review Process

Authors: Nihar B. Shah, Behzad Tabibian, Krikamol Muandet, Isabelle Guyon, Ulrike von Luxburg

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we analyze several aspects of the data collected during the review process, including an experiment investigating the eﬃcacy of collecting ordinal rankings from reviewers. We make a number of key observations, provide suggestions that may be useful for subsequent conferences, and discuss open problems towards the goal of improving peer review.
Researcher Affiliation	Academia	Nihar B. Shah EMAIL Machine Learning Department and Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA; Behzad Tabibian EMAIL Max Planck Institute for Intelligent Systems, and Max Planck Institute for Software Systems Tübingen, Germany; Krikamol Muandet EMAIL Max Planck Institute for Intelligent Systems Tübingen, Germany; Isabelle Guyon EMAIL Universite Paris-Saclay, France, and Cha Learn, California; Ulrike von Luxburg EMAIL University of Tübingen, and Max Planck Institute for Intelligent Systems, Tübingen, Germany
Pseudocode	No	The paper describes procedures and methods in paragraph text and numbered steps (e.g., for the messy middle model), but does not contain any clearly labeled pseudocode or algorithm blocks with structured formatting.
Open Source Code	No	The paper mentions that authors Behzad Tabibian and Krikamol Muandet "were also the workﬂow team of NIPS 2016 and were responsible for all the programs, scripts and CMT-related issues during the review process." However, there is no explicit statement about making their own code or scripts open-source for this paper's analysis or method. No links or repositories are provided.
Open Datasets	No	The paper analyzes "the data collected during the review process" of NIPS 2016 and NIPS 2015. This data appears to be proprietary to the NIPS conference organizers and there is no indication that it is publicly available. No links, DOIs, or citations to public datasets are provided for the data analyzed in the paper.
Dataset Splits	Yes	Wherever applicable, we also perform our analyses on a subset of the submitted papers which we term as the top 2k papers. The top 2k papers comprise all of the 568 accepted papers, and an equal number (568) of the rejected papers. The 568 rejected papers are chosen as those with the maximum mean score (where the mean for any paper is taken across all reviewers and all reviewers).
Hardware Specification	No	The paper analyzes data from the NIPS 2016 review process and statistical methods. It does not describe any computational experiments that would require specific hardware, nor does it mention any hardware specifications for the analysis performed.
Software Dependencies	No	The paper mentions the "Toronto paper matching system or TPMS" and "CMT-related issues" as part of the NIPS 2016 review process, but it does not specify any software or library dependencies with version numbers used for the statistical analysis presented in the paper.
Experiment Setup	Yes	All t-tests conducted correspond to two-sample t-tests with unequal variances. All mentions of p-values correspond to two-sided tail probabilities. All mentions of statistical signiﬁcance correspond to a p-value threshold of 0.01 (we also provide the exact p-values alongside). Multiple testing is accounted for using the Bonferroni correction. The eﬀect sizes refer to Cohen s d. Wherever applicable, the error bars in the ﬁgures represent 95% conﬁdence intervals.