Eliciting Causal Abilities in Large Language Models for Reasoning Tasks
Authors: Yajing Wang, Zongwei Luo, Jingzhe Wang, Zhanke Zhou, Yongqiang Chen, Bo Han
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method effectively generates instructions that enhance reasoning performance with reduced training cost of prompts, leveraging interpretable textual features to provide actionable insights. |
| Researcher Affiliation | Academia | Yajing Wang1,2, Zongwei Luo3,4*, Jingzhe Wang1, Zhanke Zhou2, Yongqiang Chen5, Bo Han2 1 Department of Computer Science, BNU-HKBU United International College 2 TMLR Group, Hong Kong Baptist University 3 Artificial Intelligence and Future Networks IAIFN, Beijing Normal University at Zhuhai 4 Guangdong Provincial Key Laboratory of IRADS 5 The Chinese University of Hong Kong EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes Figure 2 which illustrates the overall process of the method, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/dsubuntu/SCIE |
| Open Datasets | Yes | We evaluate ten common datasets across four categories of reasoning tasks for the experiment. (1) Arithmetic reasoning: GSM8K (Cobbe et al. 2021) and Multi Arith (Roy and Roth 2015). (2) Commonsense reasoning: Strategy QA (Geva et al. 2021) and Commonsense QA (Talmor et al. 2019). (3) Symbolic reasoning: Coin Flip (Wei et al. 2022), Last Letter Concatenation (Wei et al. 2022) and Boolean Expressions (Suzgun et al. 2023). (4) Other logical reasoning: Causal Judgement, Date Understanding, and Disambiguation QA from Big Bench Hard (BBH) (Suzgun et al. 2023). |
| Dataset Splits | Yes | For datasets like GSM8K, where the training and test sets are pre-defined, we perform random sampling on the training set and evaluate using the test set. For datasets without predefined training and test sets, we exclude the sampled data used in the SCIE process during testing on the reasoning tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'open interpreter (Open Interpreter 2024)' for ATE estimation and various LLM models like 'GPT-4o mini' and 'GPT-3.5 Turbo', but does not provide specific version numbers for general ancillary software dependencies (e.g., programming languages, libraries, frameworks) used in their experimental setup. |
| Experiment Setup | Yes | We use Zero-Shot Co T as the base instruction and generated high-quality observational data within SCIE, where a = 9, b = 5, and n = 8, which is automatically generated by GPT-4o mini (using this setting in the following experiments if not specified). Due to the lengthy instructions of Agent Instruct, we set a = 5, b = 5, for cost control. |