reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Great Models Think Alike and this Undermines AI Oversight

Authors: Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from weak-to-strong generalization. Overall, our work proposes a novel probabilistic metric for model similarity, and demonstrates the risks of correlated mistakes in the emerging paradigm of AI oversight.
Researcher Affiliation	Collaboration	1ELLIS Institute Tübingen 2Max Planck Institute for Intelligent Systems 3Tübingen AI Center 4University of Tübingen 5IIIT Hyderabad 6Contextual AI 7Stanford University
Pseudocode	No	The paper describes its methodology using mathematical definitions and textual explanations in Section 2 and Appendix A, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific repository link or an explicit statement confirming the release of their implementation code. It only mentions 'model-similarity.github.io lm-similarity' which appears to be a project website, and a general hope for community code release: 'We hope the community shifts towards releasing sample-wise model predictions alongside benchmark scores (Burnell et al., 2023; Ghosh et al., 2024), as they enable richer analysis like measuring similarity.'
Open Datasets	Yes	We collect sample-wise evaluation files for 130 official models from the Open LLM Leaderboard 2 released by Hugging Face, listed in Appendix D.5. We use MMLU-Pro (Wang et al., 2024) and Big Bench Hard (BBH) (Suzgun et al., 2023) as they measure a broad range of capabilities using MCQ, and frontier models have reasonable accuracies while not saturating these datasets.
Dataset Splits	Yes	The setup uses a pretrained weak base model W, a pretrained strong base model S and a dataset D, where Dtr, Dval, Dte are the training (10,000 samples), validation (1,000 samples) and test (5,000 samples) datasplits respectively. Dtr is divided into two halves, independently assigning each sample to Dtr1, Dtr2 with 50% probability each.
Hardware Specification	No	The paper mentions using 'vLLM as the backend for the LM Eval Harness (Kwon et al., 2023)' in Appendix B.8, but it does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments.
Software Dependencies	No	The paper mentions using 'LM Evaluation Harness (Gao et al., 2023)' and 'vLLM (Kwon et al., 2023)' as tools and 'Low Rank Adapters (LoRA) (Hu et al., 2022)' as a finetuning technique, but it does not provide specific version numbers for any software libraries, frameworks, or programming languages.
Experiment Setup	Yes	Following Scherlis et al. (2024) we use a cosine learning rate schedule, with 40 warmup steps, the learning rates for the weak, strong model are 5e-4, 8e-5 respectively, and we train for 3 epochs which is sufficient for the train and validation loss to stabilize.