reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After validating the relevance of rationale-parsed skills and inferring skills for 46k instances over 12 benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is 18% more accurate in computing molar mass, but 19% less accurate in applying constitutional law, despite the overall accuracies of the three models differing by a mere 0.4%. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a 3% accuracy improvement over our 12 dataset corpus.
Researcher Affiliation	Collaboration	Mazda Moayeri1,2 , Vidhisha Balachandran1, Varun Chandrasekaran1,3, Safoora Yousefi1, Thomas Fel4, Soheil Feizi2, Besmira Nushi1, Neel Joshi1, Vibhav Vineet1 1Microsoft Research AI Frontiers, 2University of Maryland, 3University of Illinois Urbana-Champaign, 4Harvard University Kempner Institute EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology in detail in Section 2, 'Constructing Skill-Slices Using Model-Generated Rationales', and provides prompt examples in Appendix C for skill extraction, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code	Yes	We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights.
Open Datasets	Yes	We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights. Datasets. We include datasets from 12 benchmarks in our study, consisting of 11 multimodal (image and text) datasets and 1 language-only dataset. These datasets are: MMLU Pro, a language-only benchmark by Wang et al. (2024), intended to be a harder version of MMLU (Hendrycks et al., 2021) MMMU, a multimodal benchmark with college level questions from many academic subjects intended to test expert AI, by Yue et al. (2024) Math Vista, a mathematics visual undersanding benchmark by Lu et al. (2024) MMC, a chart understanding multimodal benchmark by Liu et al. (2024) MMVP, a benchmark specifically focusing on failure modes of VLMs, by Liang et al. (2024) Many general multimodal benchmarks testing numerous abilities: MMBench, by Liu et al. (2023) MMTBench, by Ying et al. (2024) MME, by Fu et al. (2023) MMVet, by Yu et al. (2024b) SEEDBench, by Li et al. (2023) Realworld-QA, a benchmark that claims to test many realworld visual understanding questions, by x AI (2023) Vibe Eval, by Padlewski et al. (2024) (also referred to as reka vibe, as it was produced by the company Reka).
Dataset Splits	Yes	To test this, we route each instance in our corpus to one of GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, based on the skill annotations for that instance and the skill-wise accuracies per model computed over the remaining corpus (i.e. without the test instance). To obtain a single score per instance per model, we take a weighted average of skill-wise accuracies, where the weight for each skill is the inverse of its slice size (so to upweight finer-grained, more specific skills).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It refers to using strong models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, but does not specify the computational resources on which the authors conducted their analyses and evaluations.
Software Dependencies	No	The paper mentions using specific models like "GPT-4o", "SFR-Embedding-2 R model (Meng* et al., 2024)", and "sentence-transformers (Reimers & Gurevych, 2019)". However, it does not provide specific version numbers for software libraries or environments (e.g., Python version, specific sentence-transformers library version) that would be needed for replication.
Experiment Setup	No	The paper describes a methodology for evaluating foundation models using skill-slices and rationale parsing, including how prompts are constructed for GPT-4o. However, it does not provide specific hyperparameters or training configurations for any model training (as it evaluates existing, pre-trained foundation models) or detailed system-level settings beyond the prompting strategy.