reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

UniCoTT: A Unified Framework for Structural Chain-of-Thought Distillation

Authors: Xianwei Zhuang, Zhihong Zhu, Zhichang Wang, Xuxin Cheng, Yuexian Zou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple datasets of factual reasoning, multi-choice question answering and NLU tasks demonstrate the effectiveness and universality of Uni Co TT. 4 EXPERIMENTS 4.3 ABLATION STUDY AND ANALYSIS
Researcher Affiliation	Academia	Xianwei Zhuang , Zhihong Zhu , Zhichang Wang , Xuxin Cheng, Yuexian Zou School of Electronic and Computer Engineering, Peking University EMAIL
Pseudocode	Yes	To demonstrate the construction process of our Uni Co T more clearly, we provide algorithmic pseudocode for uniformly constructing different structures of Co T, namely Uni Co T, in Alg. 1 and Code 1. Algorithm 1 Algorithm for Iteratively Constructing Uni Co T Code 1: Pseudocode for Iteratively Constructing Uni Co T
Open Source Code	Yes	Our code is available at https://github.com/mengchuang123/Uni Co TT.
Open Datasets	Yes	Datasets. We conduct experiments across three types of tasks: (1) Factual Reasoning Task. We evaluate our Uni Co TT over CREAK (Onoe et al., 2021), Strategy QA (Geva et al., 2021) and CSQA2 (Talmor et al., 2021) datasets... (2) Multiple-Choice Question Answer. We select CSQA (Talmor et al., 2018), QASC (Khot et al., 2020), and OBQA (Mihaylov et al., 2018) datasets... (3) Natural Language Understanding (NLU). In the realm of NLU, we utilized the Co LA (Warstadt et al., 2019), RTE (Poliak, 2020), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005) datasets from GLUE benchmark (Wang et al., 2018)...
Dataset Splits	No	The paper does not explicitly provide the specific training/test/validation splits (e.g., percentages or exact counts) used for the main experiments. It mentions using a 10% random sample of the CREAK dataset for an ablation study, but not for the primary results. While it refers to 'previous work settings' for evaluation metrics, this doesn't explicitly define the splits used for reproducibility.
Hardware Specification	Yes	All experiments using the encoder-only models are conducted on 8 RTX 3090 GPUs. While our primary experiments focus on encoder-based models, we also extend Uni Co TT to decoder-only architectures to validate its generalizability. The experiments using the decoder-only model (i.e., Qwen2.5-3B-Instruct) are conducted on A100-80G GPUs
Software Dependencies	No	The paper mentions using specific LLMs like gpt-3.5-turbo-1106 and Qwen2.5-3B-Instruct, and base models like RoBERTa, BERT, and XLNet. It also states: "We implement all methods based on Huggingface Transformers (Wolf et al., 2020)." and "We utilized the LLa MA-factory 2 framework to implement and train our method." However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, the Huggingface Transformers library itself, or LLaMA-factory.
Experiment Setup	Yes	The hyperparameter α and β in Eq. 11 are set to 0.5 and 0.2 to achieve the optimal performance in experiments. The hidden size for text is set to 768. We employ Adam as the optimizer with a weight decay of 0.01. We tune all models for 6 epochs and set the learning rate of 3e-6 on all the datasets. When conducting experiments in CSQA and CSQA2, batch sizes were set to 5 and 2, respectively. In addition, we uniformly set the batch size for training on other datasets to 8.