A Gold Standard Dataset for the Reviewer Assignment Problem
Authors: Ivan Stelmakh, John Frederick Wieting, Yang Xi, Graham Neubig, Nihar B Shah
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. |
| Researcher Affiliation | Academia | Ivan Stelmakh, John Wieting, Sarina Xi, Graham Neubig, and Nihar B. Shah Carnegie Mellon University. Corresponding author: EMAIL |
| Pseudocode | No | The paper describes the data collection process and the formal definition of the loss metric, and details of existing algorithms, but it does not present structured pseudocode or algorithm blocks for the methodology developed in this paper. |
| Open Source Code | Yes | First, we collect and release a high-quality dataset of reviewers expertise that can be used for training and/or evaluation of similarity-computation algorithms. The dataset can be found on the project s Git Hub page: https://github.com/niharshah/goldstandard-reviewer-paper-match. We use implementations of these methods that are available on the Open Review Git Hub page2 and execute them with default parameters. (footnote 2: https://github.com/openreview/openreview-expertise). The Association for Computational Linguistics (ACL) ... has its own method to compute expertise between papers and reviewers.3 (footnote 3: https://github.com/acl-org/reviewer-paper-matching) |
| Open Datasets | Yes | We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers... The dataset can be found on the project s Git Hub page: https://github.com/niharshah/goldstandard-reviewer-paper-match |
| Dataset Splits | No | The paper does not provide traditional training/validation/test splits of its collected dataset. It defines 'Easy triplets' and 'Hard triplets' as subsets for stratified evaluation but these are not for model training. The models evaluated are either pre-trained or the paper doesn't specify how they would be trained on this dataset using such splits. |
| Hardware Specification | No | The paper mentions that the TPMS algorithm is 'fast to execute' but does not provide any specific details about the hardware (CPU, GPU models, memory, etc.) used for running the experiments or comparisons. |
| Software Dependencies | No | The paper mentions several algorithms and tools (ELMo, Specter, Specter2, TPMS, ACL algorithm, SentencePiece, Sci BERT, o1-mini, Claude Sonnet 3.5, Gemini 2 Flash) but does not provide specific version numbers for the software dependencies used in their experimental setup. For instance, SentencePiece is mentioned, but its version is not specified. |
| Experiment Setup | Yes | In our experiments, we construct reviewer profiles automatically by using the 20 most recent papers from their Semantic Scholar profiles. If a reviewer has less than 20 papers published, we include all of them in their profile... To average this randomness out, we repeat the procedure of profile construction and similarity prediction 10 times, and report the mean loss over these iterations... Paper representation. We choose option (ii) [title and abstract] as this option is often used in real conferences and is supported by all algorithms we consider in this work. |