reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Eliminating Position Bias of Language Models: A Mechanistic Approach

Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham Kakade, Hao Peng, Heng Ji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments aim to show PINE can improve model performance across diverse tasks and have superior performance than other approaches. We select four tasks that pose position bias: LM-as-a-judge (Zheng et al., 2024b) that prompts LMs to select a better response out of two given a question, retrieval-augmented question-answering (Liu et al., 2024) that asks LMs to answer questions based on retrieved documents, molecule generation based on provided properties (Ramakrishnan et al., 2014), and math reasoning based on several given conditions Chen et al. (2024b).
Researcher Affiliation	Academia	1 University of Illinois Urbana-Champaign, 2 Harvard University, 3 Texas A&M University.
Pseudocode	No	The paper describes the method using textual explanations and figures (e.g., Figure 2) but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT Experiment details are described in Section 4.1 and Appendix E. Codes are uploaded to: https://github.com/wzq016/PINE.
Open Datasets	Yes	We benchmark our method on 23 datasets in the Reward Bench4 (Lambert et al., 2024b)... We follow the settings and use the prompts, data, and evaluation scripts of (Liu et al., 2024)... We train such an LM with QM9 (Ramakrishnan et al., 2014) dataset. We use R-GSM (Chen et al., 2024a), a subset of GSM8K.
Dataset Splits	Yes	We use the official data split, prompts, and evaluation scripts to ensure reproducibility. We split the training dataset of QM9 to two subsets where each subset has 50k samples. We further clean this dataset to remove problems where conditions do not read smoothly after changing positions... yielding a small set containing 95 problems.
Hardware Specification	Yes	All experiments are launched with a single node of 8x A100 80G with SXM connection. 70B and 110B models are launched with 3x and 4x A100, and other model sizes can be launched with 1x A100.
Software Dependencies	No	The paper mentions using "Py Torch (Ansel et al., 2024; Paszke et al., 2019)", "Transformers (Wolf et al., 2020)", and "v LLM (Kwon et al., 2023)" but does not specify particular version numbers for these software packages.
Experiment Setup	Yes	We follow previous work (Liu et al., 2024; Lambert et al., 2024a) and use temperature 0 in avoid variance. The LM is a 8-layer Llama model with 8 attention heads and 768 hidden dimensions.