reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Eliciting Causal Abilities in Large Language Models for Reasoning Tasks

Authors: Yajing Wang, Zongwei Luo, Jingzhe Wang, Zhanke Zhou, Yongqiang Chen, Bo Han

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our method effectively generates instructions that enhance reasoning performance with reduced training cost of prompts, leveraging interpretable textual features to provide actionable insights.
Researcher Affiliation	Academia	Yajing Wang1,2, Zongwei Luo3,4*, Jingzhe Wang1, Zhanke Zhou2, Yongqiang Chen5, Bo Han2 1 Department of Computer Science, BNU-HKBU United International College 2 TMLR Group, Hong Kong Baptist University 3 Artificial Intelligence and Future Networks IAIFN, Beijing Normal University at Zhuhai 4 Guangdong Provincial Key Laboratory of IRADS 5 The Chinese University of Hong Kong EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper includes Figure 2 which illustrates the overall process of the method, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/dsubuntu/SCIE
Open Datasets	Yes	We evaluate ten common datasets across four categories of reasoning tasks for the experiment. (1) Arithmetic reasoning: GSM8K (Cobbe et al. 2021) and Multi Arith (Roy and Roth 2015). (2) Commonsense reasoning: Strategy QA (Geva et al. 2021) and Commonsense QA (Talmor et al. 2019). (3) Symbolic reasoning: Coin Flip (Wei et al. 2022), Last Letter Concatenation (Wei et al. 2022) and Boolean Expressions (Suzgun et al. 2023). (4) Other logical reasoning: Causal Judgement, Date Understanding, and Disambiguation QA from Big Bench Hard (BBH) (Suzgun et al. 2023).
Dataset Splits	Yes	For datasets like GSM8K, where the training and test sets are pre-defined, we perform random sampling on the training set and evaluate using the test set. For datasets without predefined training and test sets, we exclude the sampled data used in the SCIE process during testing on the reasoning tasks.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'open interpreter (Open Interpreter 2024)' for ATE estimation and various LLM models like 'GPT-4o mini' and 'GPT-3.5 Turbo', but does not provide specific version numbers for general ancillary software dependencies (e.g., programming languages, libraries, frameworks) used in their experimental setup.
Experiment Setup	Yes	We use Zero-Shot Co T as the base instruction and generated high-quality observational data within SCIE, where a = 9, b = 5, and n = 8, which is automatically generated by GPT-4o mini (using this setting in the following experiments if not specified). Due to the lengthy instructions of Agent Instruct, we set a = 5, b = 5, for cost control.