Correlated Errors in Large Language Models
Authors: Elliot Myunghoon Kim, Avi Garg, Kenny Peng, Nikhil Garg
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. |
| Researcher Affiliation | Collaboration | 1Cornell University 2Independent. Correspondence to: Kenny Peng <EMAIL>, Nikhil Garg <EMAIL>. |
| Pseudocode | No | The paper describes its methodology in narrative text and refers to specific prompts in Figure 7, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://github.com/nikhgarg/llm_correlated_errors_public/. |
| Open Datasets | Yes | We started from two LLM leaderboards: (1) Hugging Face’s Open LLM Leaderboard; and (2) Stanford’s Holistic Evaluation of Language Models (Helm) (Liang et al., 2023)... Starting from large datasets of job postings (Asaniczka, 2024) and resumes (Bhawal, 2022; Jiechieu & Tsopze, 2021). |
| Dataset Splits | Yes | We hand-label 450 resume-job pairs (30 unique resumes and 15 job descriptions) using the same criteria as our prompts. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as CPU/GPU models or memory) used for running its experiments or analysis. It mentions API credits from Meta and Amazon, implying the use of cloud services for accessing LLMs, but no details about the local hardware for their own analysis. |
| Software Dependencies | No | The paper mentions 'Sentence Transformers (SBERT)' in Appendix A.2, but does not provide a specific version number for this or any other software dependency. |
| Experiment Setup | Yes | In our experiments, we set p = 0.25, so the top quarter of applicants receive interviews at each firm... Each firm has capacity of 1: each applicant accepts at most one job offer, and each firm can hire at most one applicant. For all experiments in this section, each applicant a A has uniformly random preferences over firms. |