Quantifying Generalization Complexity for Large Language Models

Authors: Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Hima Lakkaraju, James R Glass

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold referred to as critical complexity where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging SCYLLA and the concept of critical complexity, we benchmark 28 LLMs including both open-sourced models such as LLa MA and Qwen families, and closed-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs generalization capabilities.
Researcher Affiliation Collaboration Zhenting Qi1 , Hongyin Luo2, Xuliang Huang3, Zhuokai Zhao4,5, Yibo Jiang5 Xiangjun Fan4, Himabindu Lakkaraju1, James Glass2 1Harvard University, 2Massachusetts Institute of Technology 3University of Illinois at Urbana-Champaign, 4Meta, 5University of Chicago
Pseudocode No The paper describes algorithms for various tasks (e.g., Find Minimum, Sort Numbers, Two Sum) in prose within Appendix D but does not provide structured pseudocode or algorithm blocks for the methodology developed in the paper.
Open Source Code Yes Source code will be available at https://github.com/zhentingqi/scylla.
Open Datasets No SCYLLA disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. All data are generated during the evaluation, ensuring that each evaluation instance is unique and unaffected by pre-exposed data.
Dataset Splits Yes Finally, 256 test samples are selected for each of the ID and OOD dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies No The paper mentions tools and methods such as "WIMBD (What s In My Big Data) (Elazar et al., 2023)" and "zero-shot chain-of-thought method (Kojima et al., 2022)", but it does not specify versions for any programming languages, libraries, or other software components used in their implementation.
Experiment Setup Yes We used zero-shot chain-of-thought method (Kojima et al., 2022) when prompting these models for solutions. Our choice to focus on zero-shot was deliberate, as adding few-shot examples would introduce a confounding variable the selection and structure of the examples which could bias the results and obscure the intrinsic effects of model generalization. Prompt Template 4. Zero-shot Chain-of-Thought Prompting: Here is a task: {instruction}. Solve the task with the following input: {input}. IMPORTANT: End your response with The answer is <ANSWER> where you should fill <ANSWER> with your final answer and must format the final answer obeying the following rules: {answer format requirements}. Your response: Let s think step by step. Finally, 256 test samples are selected for each of the ID and OOD dataset.