Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After validating the relevance of rationale-parsed skills and inferring skills for 46k instances over 12 benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is 18% more accurate in computing molar mass, but 19% less accurate in applying constitutional law, despite the overall accuracies of the three models differing by a mere 0.4%. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a 3% accuracy improvement over our 12 dataset corpus.
Researcher Affiliation Collaboration Mazda Moayeri1,2 , Vidhisha Balachandran1, Varun Chandrasekaran1,3, Safoora Yousefi1, Thomas Fel4, Soheil Feizi2, Besmira Nushi1, Neel Joshi1, Vibhav Vineet1 1Microsoft Research AI Frontiers, 2University of Maryland, 3University of Illinois Urbana-Champaign, 4Harvard University Kempner Institute EMAIL, EMAIL
Pseudocode No The paper describes its methodology in detail in Section 2, 'Constructing Skill-Slices Using Model-Generated Rationales', and provides prompt examples in Appendix C for skill extraction, but it does not include a formally structured pseudocode or algorithm block.
Open Source Code Yes We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights.
Open Datasets Yes We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights. Datasets. We include datasets from 12 benchmarks in our study, consisting of 11 multimodal (image and text) datasets and 1 language-only dataset. These datasets are: MMLU Pro, a language-only benchmark by Wang et al. (2024), intended to be a harder version of MMLU (Hendrycks et al., 2021) MMMU, a multimodal benchmark with college level questions from many academic subjects intended to test expert AI, by Yue et al. (2024) Math Vista, a mathematics visual undersanding benchmark by Lu et al. (2024) MMC, a chart understanding multimodal benchmark by Liu et al. (2024) MMVP, a benchmark specifically focusing on failure modes of VLMs, by Liang et al. (2024) Many general multimodal benchmarks testing numerous abilities: MMBench, by Liu et al. (2023) MMTBench, by Ying et al. (2024) MME, by Fu et al. (2023) MMVet, by Yu et al. (2024b) SEEDBench, by Li et al. (2023) Realworld-QA, a benchmark that claims to test many realworld visual understanding questions, by x AI (2023) Vibe Eval, by Padlewski et al. (2024) (also referred to as reka vibe, as it was produced by the company Reka).
Dataset Splits Yes To test this, we route each instance in our corpus to one of GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, based on the skill annotations for that instance and the skill-wise accuracies per model computed over the remaining corpus (i.e. without the test instance). To obtain a single score per instance per model, we take a weighted average of skill-wise accuracies, where the weight for each skill is the inverse of its slice size (so to upweight finer-grained, more specific skills).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It refers to using strong models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, but does not specify the computational resources on which the authors conducted their analyses and evaluations.
Software Dependencies No The paper mentions using specific models like "GPT-4o", "SFR-Embedding-2 R model (Meng* et al., 2024)", and "sentence-transformers (Reimers & Gurevych, 2019)". However, it does not provide specific version numbers for software libraries or environments (e.g., Python version, specific sentence-transformers library version) that would be needed for replication.
Experiment Setup No The paper describes a methodology for evaluating foundation models using skill-slices and rationale parsing, including how prompts are constructed for GPT-4o. However, it does not provide specific hyperparameters or training configurations for any model training (as it evaluates existing, pre-trained foundation models) or detailed system-level settings beyond the prompting strategy.