Eliminating Position Bias of Language Models: A Mechanistic Approach

Authors: Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham Kakade, Hao Peng, Heng Ji

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to show PINE can improve model performance across diverse tasks and have superior performance than other approaches. We select four tasks that pose position bias: LM-as-a-judge (Zheng et al., 2024b) that prompts LMs to select a better response out of two given a question, retrieval-augmented question-answering (Liu et al., 2024) that asks LMs to answer questions based on retrieved documents, molecule generation based on provided properties (Ramakrishnan et al., 2014), and math reasoning based on several given conditions Chen et al. (2024b).
Researcher Affiliation Academia 1 University of Illinois Urbana-Champaign, 2 Harvard University, 3 Texas A&M University.
Pseudocode No The paper describes the method using textual explanations and figures (e.g., Figure 2) but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code Yes REPRODUCIBILITY STATEMENT Experiment details are described in Section 4.1 and Appendix E. Codes are uploaded to: https://github.com/wzq016/PINE.
Open Datasets Yes We benchmark our method on 23 datasets in the Reward Bench4 (Lambert et al., 2024b)... We follow the settings and use the prompts, data, and evaluation scripts of (Liu et al., 2024)... We train such an LM with QM9 (Ramakrishnan et al., 2014) dataset. We use R-GSM (Chen et al., 2024a), a subset of GSM8K.
Dataset Splits Yes We use the official data split, prompts, and evaluation scripts to ensure reproducibility. We split the training dataset of QM9 to two subsets where each subset has 50k samples. We further clean this dataset to remove problems where conditions do not read smoothly after changing positions... yielding a small set containing 95 problems.
Hardware Specification Yes All experiments are launched with a single node of 8x A100 80G with SXM connection. 70B and 110B models are launched with 3x and 4x A100, and other model sizes can be launched with 1x A100.
Software Dependencies No The paper mentions using "Py Torch (Ansel et al., 2024; Paszke et al., 2019)", "Transformers (Wolf et al., 2020)", and "v LLM (Kwon et al., 2023)" but does not specify particular version numbers for these software packages.
Experiment Setup Yes We follow previous work (Liu et al., 2024; Lambert et al., 2024a) and use temperature 0 in avoid variance. The LM is a 8-layer Llama model with 8 attention heads and 768 hidden dimensions.