reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Correlated Errors in Large Language Models

Authors: Elliot Myunghoon Kim, Avi Garg, Kenny Peng, Nikhil Garg

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers.
Researcher Affiliation	Collaboration	1Cornell University 2Independent. Correspondence to: Kenny Peng <EMAIL>, Nikhil Garg <EMAIL>.
Pseudocode	No	The paper describes its methodology in narrative text and refers to specific prompts in Figure 7, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/nikhgarg/llm_correlated_errors_public/.
Open Datasets	Yes	We started from two LLM leaderboards: (1) Hugging Face’s Open LLM Leaderboard; and (2) Stanford’s Holistic Evaluation of Language Models (Helm) (Liang et al., 2023)... Starting from large datasets of job postings (Asaniczka, 2024) and resumes (Bhawal, 2022; Jiechieu & Tsopze, 2021).
Dataset Splits	Yes	We hand-label 450 resume-job pairs (30 unique resumes and 15 job descriptions) using the same criteria as our prompts.
Hardware Specification	No	The paper does not provide specific hardware details (such as CPU/GPU models or memory) used for running its experiments or analysis. It mentions API credits from Meta and Amazon, implying the use of cloud services for accessing LLMs, but no details about the local hardware for their own analysis.
Software Dependencies	No	The paper mentions 'Sentence Transformers (SBERT)' in Appendix A.2, but does not provide a specific version number for this or any other software dependency.
Experiment Setup	Yes	In our experiments, we set p = 0.25, so the top quarter of applicants receive interviews at each firm... Each firm has capacity of 1: each applicant accepts at most one job offer, and each firm can hire at most one applicant. For all experiments in this section, each applicant a A has uniformly random preferences over firms.