Causally Motivated Sycophancy Mitigation for Large Language Models
Authors: Haoxi Li, Xueyang Tang, Jie ZHANG, Song Guo, Sikai Bai, Peiran Dong, Yue Yu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted across diverse language tasks to demonstrate the superiority of our method over state-of-the-art competitors in mitigating sycophancy in LLMs. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology 2The Hong Kong Polytechnic University 3Peng Cheng Laboratory |
| Pseudocode | No | The paper describes its methodology using structured causal models and mathematical equations (e.g., objective functions), but it does not include any explicitly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. The text only refers to supplementary materials for detailed descriptions of datasets and baselines, but not code. |
| Open Datasets | Yes | Our primary evaluation suite is Sycophancy Eval, which extends existing assessments by incorporating realistic, open-ended text-generation tasks. This suite is based on the work of (Sharma et al., 2024) and includes subsets of six QA datasets: (i) MMLU (Hendrycks et al., 2020); (ii) MATH (Hendrycks et al., 2021); (iii) AQu A (Ling et al., 2017); (iv) Truthful QA (Lin et al., 2021); (v) Trivia QA (Joshi et al., 2017); and (vi) Poem (Sharma et al., 2024). |
| Dataset Splits | Yes | Specifically, we split Truthful QA into halves: one for development (split 4:1 for training and validation) and the other for testing. |
| Hardware Specification | Yes | In addition, all experiments are implemented on four NVIDIA Geforce A100 GPUs. |
| Software Dependencies | No | The paper mentions "Lang Chain library" in section A.2.3 but does not provide a specific version number for it or any other software dependency. |
| Experiment Setup | Yes | We perform three training epochs (2 : 1) alternately to update intervention prompts and heads weight matrix, and set their learning rates to 1e 5 and 2e 3, respectively. The total number of epochs is 40. ... We sweep two hyperparameters, K and λ, controlling the strength of calibration, using 5% of randomly sampled questions from Truthful QA for training and validation. The optimal hyperparameters are K = 48 and λ = 0.1. |