The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Wong, Simon See
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments involving 16 stateof-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that Co T and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. The main experimental results are illustrated in Figure 2 (full results in Appendix E). Across nine ICL benchmarks, LLMs employing direct answering substantially outperform Co T, achieving a relative improvement of 20.42% (absolute 5.10%). |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology, 2NVIDIA EMAIL, EMAIL |
| Pseudocode | No | The paper describes prompt templates and case studies which include structured text or Python code examples (in Appendix B.1.1, B.2, B.3, B.4), but it does not contain any block explicitly labeled 'Pseudocode' or 'Algorithm' describing the methodology of the paper itself. |
| Open Source Code | Yes | Datasets We conduct experiments on a diverse selection of pattern-based in-context learning datasets spanning multiple modalities1: 1) Symbolic: Pattern-based transformations between symbolic matrices, e.g., ARC-AGI and Mini ARC. 2) Textual: Rule-based translations between natural language and artificial languages, e.g., SCAN and COGS. 3) Numerical: Pattern-based or function-based projections between numerical vectors or matrices, e.g., List Functions and RAVEN. 1https://github.com/HKUST-Know Comp/Co T-ICL-Eval |
| Open Datasets | Yes | Table 1: In-context learning datasets in our experiments. Dataset # Demos Modality Size ARC-AGI (Chollet, 2019) 2 10 Symbolic 835 Mini ARC (Kim et al., 2022) 2 8 Symbolic 149 1D-ARC (Xu et al., 2024) 3 Symbolic 901 SCAN (Lake & Baroni, 2018) 5 8 Textual 1,000 Mini SCAN (Nye et al., 2020) 14 Textual 1,000 COGS (Kim & Linzen, 2020) 10 Textual 1,000 SALT (Zheng et al., 2025a) 4 Textual 1,200 List Function (Rule, 2020) 3 Numerical 1,250 RAVEN (Zhang et al., 2019) 2 Numerical / Symbolic 1,259 |
| Dataset Splits | Yes | COGS: The original COGS dataset (Kim & Linzen, 2020) evaluates the compositional generalization of machine learning models through a task that introduces compositional distribution shifts in input-output mappings. In this study, we use the test dataset, sampling 10 entries as in-context demonstrations. |
| Hardware Specification | No | The paper lists various open-source and proprietary Large Language Models (LLMs) and Large Reasoning Models (LRMs) that were evaluated (e.g., Deepseek-V3, Llama-3.1-8B, GPT-4o), but it does not specify the hardware (e.g., GPU models, CPU types, or memory) on which these models were run for the experiments. |
| Software Dependencies | No | The paper refers to using specific LLMs for evaluation (e.g., Qwen-2.5-72B, GPT-4o-mini) and mentions Python functions as ground truth in one dataset, but it does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, or TensorFlow) that were used to conduct the experiments. |
| Experiment Setup | Yes | In our experiment, we tested 20 modern LLM/LRMs. All experiments with temperature set to zero. We here provide detailed information of our four tailored experiment to investigate the underlying cause of Co T s ineffectiveness in ICL. The prompt instructions for our dummy rationale experiments are provided below: For a fair comparison of reasoning frameworks against vanilla zero-shot Co T and direct answering, we adopt a one-off prompting approach rather than a complex agent framework. |