Correlated Errors in Large Language Models

Authors: Elliot Myunghoon Kim, Avi Garg, Kenny Peng, Nikhil Garg

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers.
Researcher Affiliation Collaboration 1Cornell University 2Independent. Correspondence to: Kenny Peng <EMAIL>, Nikhil Garg <EMAIL>.
Pseudocode No The paper describes its methodology in narrative text and refers to specific prompts in Figure 7, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/nikhgarg/llm_correlated_errors_public/.
Open Datasets Yes We started from two LLM leaderboards: (1) Hugging Face’s Open LLM Leaderboard; and (2) Stanford’s Holistic Evaluation of Language Models (Helm) (Liang et al., 2023)... Starting from large datasets of job postings (Asaniczka, 2024) and resumes (Bhawal, 2022; Jiechieu & Tsopze, 2021).
Dataset Splits Yes We hand-label 450 resume-job pairs (30 unique resumes and 15 job descriptions) using the same criteria as our prompts.
Hardware Specification No The paper does not provide specific hardware details (such as CPU/GPU models or memory) used for running its experiments or analysis. It mentions API credits from Meta and Amazon, implying the use of cloud services for accessing LLMs, but no details about the local hardware for their own analysis.
Software Dependencies No The paper mentions 'Sentence Transformers (SBERT)' in Appendix A.2, but does not provide a specific version number for this or any other software dependency.
Experiment Setup Yes In our experiments, we set p = 0.25, so the top quarter of applicants receive interviews at each firm... Each firm has capacity of 1: each applicant accepts at most one job offer, and each firm can hire at most one applicant. For all experiments in this section, each applicant a A has uniformly random preferences over firms.