Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models
Authors: Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | After validating the relevance of rationale-parsed skills and inferring skills for 46k instances over 12 benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of skill-slices (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is 18% more accurate in computing molar mass, but 19% less accurate in applying constitutional law, despite the overall accuracies of the three models differing by a mere 0.4%. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a 3% accuracy improvement over our 12 dataset corpus. |
| Researcher Affiliation | Collaboration | Mazda Moayeri1,2 , Vidhisha Balachandran1, Varun Chandrasekaran1,3, Safoora Yousefi1, Thomas Fel4, Soheil Feizi2, Besmira Nushi1, Neel Joshi1, Vibhav Vineet1 1Microsoft Research AI Frontiers, 2University of Maryland, 3University of Illinois Urbana-Champaign, 4Harvard University Kempner Institute EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology in detail in Section 2, 'Constructing Skill-Slices Using Model-Generated Rationales', and provides prompt examples in Appendix C for skill extraction, but it does not include a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights. |
| Open Datasets | Yes | We will release all our rationales, skill annotations, and skill-slices, which we term the Skill-Index, to the public at github.com/microsoft/skill-slice-insights. Datasets. We include datasets from 12 benchmarks in our study, consisting of 11 multimodal (image and text) datasets and 1 language-only dataset. These datasets are: MMLU Pro, a language-only benchmark by Wang et al. (2024), intended to be a harder version of MMLU (Hendrycks et al., 2021) MMMU, a multimodal benchmark with college level questions from many academic subjects intended to test expert AI, by Yue et al. (2024) Math Vista, a mathematics visual undersanding benchmark by Lu et al. (2024) MMC, a chart understanding multimodal benchmark by Liu et al. (2024) MMVP, a benchmark specifically focusing on failure modes of VLMs, by Liang et al. (2024) Many general multimodal benchmarks testing numerous abilities: MMBench, by Liu et al. (2023) MMTBench, by Ying et al. (2024) MME, by Fu et al. (2023) MMVet, by Yu et al. (2024b) SEEDBench, by Li et al. (2023) Realworld-QA, a benchmark that claims to test many realworld visual understanding questions, by x AI (2023) Vibe Eval, by Padlewski et al. (2024) (also referred to as reka vibe, as it was produced by the company Reka). |
| Dataset Splits | Yes | To test this, we route each instance in our corpus to one of GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, based on the skill annotations for that instance and the skill-wise accuracies per model computed over the remaining corpus (i.e. without the test instance). To obtain a single score per instance per model, we take a weighted average of skill-wise accuracies, where the weight for each skill is the inverse of its slice size (so to upweight finer-grained, more specific skills). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It refers to using strong models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, but does not specify the computational resources on which the authors conducted their analyses and evaluations. |
| Software Dependencies | No | The paper mentions using specific models like "GPT-4o", "SFR-Embedding-2 R model (Meng* et al., 2024)", and "sentence-transformers (Reimers & Gurevych, 2019)". However, it does not provide specific version numbers for software libraries or environments (e.g., Python version, specific sentence-transformers library version) that would be needed for replication. |
| Experiment Setup | No | The paper describes a methodology for evaluating foundation models using skill-slices and rationale parsing, including how prompts are constructed for GPT-4o. However, it does not provide specific hyperparameters or training configurations for any model training (as it evaluates existing, pre-trained foundation models) or detailed system-level settings beyond the prompting strategy. |