reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

C3oT: Generating Shorter Chain-of-Thought Without Compromising Effectiveness

Authors: Yu Kang, Xianghui Sun, Liangyu Chen, Wei Zou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments over four datasets from arithmetic and commonsense scenarios, showing that the proposed method is capable of compressing the length of generated Co T by up to more than 50% without compromising its effectiveness. Additionally, we design extensive experiments and discussions to analyze the contribution of different components in our approach, as well as to explore future research directions of Co T compression based on our method.
Researcher Affiliation	Industry	Yu Kang, Xianghui Sun, Liangyu Chen *, Wei Zou Beike Inc., Beijing, China EMAIL
Pseudocode	No	The paper describes the C3o T framework and its components (Compressor, Conditioned Training, Conditioned Inference) in narrative form, supplemented by a diagram in Figure 1, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	For math reasoning, we use GSM8K (Cobbe et al. 2021) and Math QA (Amini et al. 2019). As for commonsense reasoning, we use ECQA (Aggarwal et al. 2021) and Strategy QA (Geva et al. 2021).
Dataset Splits	Yes	We followed the training and testing set division as outlined in the original paper of the dataset used, trained C3o T on the training set, and evaluated its performance on the test set, excluding Strategy QA. Due to the inaccessibility of ground truths for the Strategy QA test set, we proceeded to further split the original Strategy QA training set into training and test sets.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the Adam W optimizer and LLa MA-2-Chat models, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	In this paper, we train C3o T based on LLa MA-2-Chat-7B and -13B (Touvron et al. 2023). We fine-tune the model for 2 epochs on each dataset using the Adam W optimizer with a sequence length of 2,048 tokens and a batch size of 128. The Adam W optimizer s hyperparameters are set as follows: β1 = 0.9, β2 = 0.999, ϵ = 10 6, and weight decay of 0.001. We employ a cosine learning rate schedule with a maximum learning rate of 1 10 5.