reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Gold Standard Dataset for the Reviewer Assignment Problem

Authors: Ivan Stelmakh, John Frederick Wieting, Yang Xi, Graham Neubig, Nihar B Shah

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders.
Researcher Affiliation	Academia	Ivan Stelmakh, John Wieting, Sarina Xi, Graham Neubig, and Nihar B. Shah Carnegie Mellon University. Corresponding author: EMAIL
Pseudocode	No	The paper describes the data collection process and the formal definition of the loss metric, and details of existing algorithms, but it does not present structured pseudocode or algorithm blocks for the methodology developed in this paper.
Open Source Code	Yes	First, we collect and release a high-quality dataset of reviewers expertise that can be used for training and/or evaluation of similarity-computation algorithms. The dataset can be found on the project s Git Hub page: https://github.com/niharshah/goldstandard-reviewer-paper-match. We use implementations of these methods that are available on the Open Review Git Hub page2 and execute them with default parameters. (footnote 2: https://github.com/openreview/openreview-expertise). The Association for Computational Linguistics (ACL) ... has its own method to compute expertise between papers and reviewers.3 (footnote 3: https://github.com/acl-org/reviewer-paper-matching)
Open Datasets	Yes	We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers... The dataset can be found on the project s Git Hub page: https://github.com/niharshah/goldstandard-reviewer-paper-match
Dataset Splits	No	The paper does not provide traditional training/validation/test splits of its collected dataset. It defines 'Easy triplets' and 'Hard triplets' as subsets for stratified evaluation but these are not for model training. The models evaluated are either pre-trained or the paper doesn't specify how they would be trained on this dataset using such splits.
Hardware Specification	No	The paper mentions that the TPMS algorithm is 'fast to execute' but does not provide any specific details about the hardware (CPU, GPU models, memory, etc.) used for running the experiments or comparisons.
Software Dependencies	No	The paper mentions several algorithms and tools (ELMo, Specter, Specter2, TPMS, ACL algorithm, SentencePiece, Sci BERT, o1-mini, Claude Sonnet 3.5, Gemini 2 Flash) but does not provide specific version numbers for the software dependencies used in their experimental setup. For instance, SentencePiece is mentioned, but its version is not specified.
Experiment Setup	Yes	In our experiments, we construct reviewer profiles automatically by using the 20 most recent papers from their Semantic Scholar profiles. If a reviewer has less than 20 papers published, we include all of them in their profile... To average this randomness out, we repeat the procedure of profile construction and similarity prediction 10 times, and report the mean loss over these iterations... Paper representation. We choose option (ii) [title and abstract] as this option is often used in real conferences and is supported by all algorithms we consider in this work.