reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning

Authors: Yuanheng Fang, Guoqing Chao, Wenqiang Lei, Shaobo Li, Dianhui Chu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results on six datasets show the superiority of our proposed method CDWCo T over the state-of-the-art methods. The main contributions of our work are summarized as follows: Our empirical evaluations confirm that the CDW-Co T framework substantially outperforms traditional Co T methods, achieving the state-of-the-art accuracy across multiple datasets.
Researcher Affiliation	Academia	1Harbin Institute of Technology, Weihai, 264209, Shandong, China 2Sichuan University, Chengdu, 610065, Sichuan, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Cluster-Based Prompt Candidate Pool Initialization Algorithm 2: Distance-Weighted Prompt Selection and Inference
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	Yes	Commonsense Reasoning: Commonsense QA (CSQA) (Talmor et al. 2018): A widely used dataset for evaluating commonsense reasoning through multiple-choice questions that require inferencing based on prior knowledge and context. Strategy QA (Geva et al. 2021): It contains questions requiring implicit multi-hop reasoning to derive yes/no answers, testing the model s ability to connect various pieces of information logically. Symbolic Reasoning: Letter (Wei et al. 2022): It involves tasks like last letter concatenation, designed to test the symbolic reasoning capabilities of models. Coin (Wei et al. 2022): It focuses on determining the state of a coin after a series of flips, evaluating the model s ability to track state changes through symbolic manipulations. Mathematical Reasoning: Multi Arith (Roy and Roth 2016): It consists of multistep arithmetic word problems that require a sequence of operations to reach the solution, testing multi-step reasoning in arithmetic contexts. AQu A (Ling et al. 2017): It includes complex arithmetic word problems with multiple-choice answers, providing a benchmark for evaluating sophisticated reasoning and calculation skills.
Dataset Splits	Yes	Datasets were divided into training, evaluation, and test subsets with proportions of approximately 60%, 25%, and 15%, respectively (Wang et al. 2022b). After dividing the data, we identified the number of clusters according to the Auto-Co T setup, and then adjusted the number of clusters for certain datasets from the default 8 to 3, as shown in Table 2. Table 2: Data Split and Number of Clusters Statistics.
Hardware Specification	Yes	We conducted comparative experiments using both the LLa MA2 (13B) and LLa MA3 (8B) models, running on two NVIDIA 4090 GPUs locally.
Software Dependencies	No	The paper mentions using LLaMA2 (13B) and LLaMA3 (8B) models but does not provide specific version numbers for any other software dependencies such as programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	Pool Size: We maintained a consistent pool of 40 potential prompts for each dataset to enable thorough exploration of diverse reasoning pathways. Sample Size: During training, each instance was tested against five unique prompt combinations, assessing the effectiveness of various configurations. Temperature: A temperature of 0.3 was used to optimize prompt selection during testing.