reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions

Authors: Charvi Rastogi, Sivaraman Balakrishnan, Nihar B. Shah, Aarti Singh

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, we empirically evaluate our results via extensive simulations as well as three real-world data sets consisting of pairwise-comparisons and rankings. By applying our two-sample test on real-world pairwise-comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
Researcher Affiliation	Academia	Charvi Rastogi EMAIL Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA Sivaraman Balakrishnan EMAIL Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213, USA Nihar B. Shah EMAIL Machine Learning Department and Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA Aarti Singh EMAIL Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode	Yes	Algorithm 1: Two-sample test with pairwise-comparisons for model-free setting Algorithm 2: Permutation test with pairwise-comparisons for model-free setting. Algorithm 3: Two-sample testing with partial ranking data for Plackett-Luce model. Algorithm 4: Two-sample testing algorithm with partial ranking data for marginal probability based model.
Open Source Code	No	The paper does not provide a specific link to source code, nor does it explicitly state that the code for the described methodology is being released or is available in supplementary materials. The license information provided is for the paper itself, not the code.
Open Datasets	Yes	We use the data set from Shah et al. (2016) comprising six different experiments on the Amazon Mechanical Turk crowdsourcing platform. For our experiments, we use the Sushi preference data set Kamishima (2003)
Dataset Splits	No	The paper mentions: "We randomly sub-sampled n samples from each sub-group of subjects and used 200 permutations to determine the rejection threshold for the permutation test." This describes a sampling strategy for experimental evaluation but does not specify explicit training/test/validation splits for model development or evaluation in the standard sense (e.g., 80/10/10 split, or fixed numbers for training, validation, and testing sets).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the simulations or experiments.
Software Dependencies	No	The paper mentions "Wolfram Mathematica" in Section 6.1.2 for evaluating a term, but it does not specify any software dependencies with version numbers for the implementation of their algorithms or experiments.
Experiment Setup	Yes	In each of the simulations, we set the significance level to be 0.05. ...the threshold for the test is obtained by running the permutation test method over 5000 iterations. ...we randomly sub-sampled n samples from each sub-group of subjects and used 200 permutations to determine the rejection threshold for the permutation test.