reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Wong, Simon See

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments involving 16 stateof-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that Co T and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. The main experimental results are illustrated in Figure 2 (full results in Appendix E). Across nine ICL benchmarks, LLMs employing direct answering substantially outperform Co T, achieving a relative improvement of 20.42% (absolute 5.10%).
Researcher Affiliation	Collaboration	1The Hong Kong University of Science and Technology, 2NVIDIA EMAIL, EMAIL
Pseudocode	No	The paper describes prompt templates and case studies which include structured text or Python code examples (in Appendix B.1.1, B.2, B.3, B.4), but it does not contain any block explicitly labeled 'Pseudocode' or 'Algorithm' describing the methodology of the paper itself.
Open Source Code	Yes	Datasets We conduct experiments on a diverse selection of pattern-based in-context learning datasets spanning multiple modalities1: 1) Symbolic: Pattern-based transformations between symbolic matrices, e.g., ARC-AGI and Mini ARC. 2) Textual: Rule-based translations between natural language and artificial languages, e.g., SCAN and COGS. 3) Numerical: Pattern-based or function-based projections between numerical vectors or matrices, e.g., List Functions and RAVEN. 1https://github.com/HKUST-Know Comp/Co T-ICL-Eval
Open Datasets	Yes	Table 1: In-context learning datasets in our experiments. Dataset # Demos Modality Size ARC-AGI (Chollet, 2019) 2 10 Symbolic 835 Mini ARC (Kim et al., 2022) 2 8 Symbolic 149 1D-ARC (Xu et al., 2024) 3 Symbolic 901 SCAN (Lake & Baroni, 2018) 5 8 Textual 1,000 Mini SCAN (Nye et al., 2020) 14 Textual 1,000 COGS (Kim & Linzen, 2020) 10 Textual 1,000 SALT (Zheng et al., 2025a) 4 Textual 1,200 List Function (Rule, 2020) 3 Numerical 1,250 RAVEN (Zhang et al., 2019) 2 Numerical / Symbolic 1,259
Dataset Splits	Yes	COGS: The original COGS dataset (Kim & Linzen, 2020) evaluates the compositional generalization of machine learning models through a task that introduces compositional distribution shifts in input-output mappings. In this study, we use the test dataset, sampling 10 entries as in-context demonstrations.
Hardware Specification	No	The paper lists various open-source and proprietary Large Language Models (LLMs) and Large Reasoning Models (LRMs) that were evaluated (e.g., Deepseek-V3, Llama-3.1-8B, GPT-4o), but it does not specify the hardware (e.g., GPU models, CPU types, or memory) on which these models were run for the experiments.
Software Dependencies	No	The paper refers to using specific LLMs for evaluation (e.g., Qwen-2.5-72B, GPT-4o-mini) and mentions Python functions as ground truth in one dataset, but it does not list specific versions of programming languages, libraries, or frameworks (e.g., Python, PyTorch, or TensorFlow) that were used to conduct the experiments.
Experiment Setup	Yes	In our experiment, we tested 20 modern LLM/LRMs. All experiments with temperature set to zero. We here provide detailed information of our four tailored experiment to investigate the underlying cause of Co T s ineffectiveness in ICL. The prompt instructions for our dummy rationale experiments are provided below: For a fair comparison of reasoning frameworks against vanilla zero-shot Co T and direct answering, we adopt a one-off prompting approach rather than a complex agent framework.