Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions

Authors: Charvi Rastogi, Sivaraman Balakrishnan, Nihar B. Shah, Aarti Singh

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we empirically evaluate our results via extensive simulations as well as three real-world data sets consisting of pairwise-comparisons and rankings. By applying our two-sample test on real-world pairwise-comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
Researcher Affiliation Academia Charvi Rastogi EMAIL Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA Sivaraman Balakrishnan EMAIL Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213, USA Nihar B. Shah EMAIL Machine Learning Department and Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA Aarti Singh EMAIL Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode Yes Algorithm 1: Two-sample test with pairwise-comparisons for model-free setting Algorithm 2: Permutation test with pairwise-comparisons for model-free setting. Algorithm 3: Two-sample testing with partial ranking data for Plackett-Luce model. Algorithm 4: Two-sample testing algorithm with partial ranking data for marginal probability based model.
Open Source Code No The paper does not provide a specific link to source code, nor does it explicitly state that the code for the described methodology is being released or is available in supplementary materials. The license information provided is for the paper itself, not the code.
Open Datasets Yes We use the data set from Shah et al. (2016) comprising six different experiments on the Amazon Mechanical Turk crowdsourcing platform. For our experiments, we use the Sushi preference data set Kamishima (2003)
Dataset Splits No The paper mentions: "We randomly sub-sampled n samples from each sub-group of subjects and used 200 permutations to determine the rejection threshold for the permutation test." This describes a sampling strategy for experimental evaluation but does not specify explicit training/test/validation splits for model development or evaluation in the standard sense (e.g., 80/10/10 split, or fixed numbers for training, validation, and testing sets).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running the simulations or experiments.
Software Dependencies No The paper mentions "Wolfram Mathematica" in Section 6.1.2 for evaluating a term, but it does not specify any software dependencies with version numbers for the implementation of their algorithms or experiments.
Experiment Setup Yes In each of the simulations, we set the significance level to be 0.05. ...the threshold for the test is obtained by running the permutation test method over 5000 iterations. ...we randomly sub-sampled n samples from each sub-group of subjects and used 200 permutations to determine the rejection threshold for the permutation test.