UniCoTT: A Unified Framework for Structural Chain-of-Thought Distillation
Authors: Xianwei Zhuang, Zhihong Zhu, Zhichang Wang, Xuxin Cheng, Yuexian Zou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple datasets of factual reasoning, multi-choice question answering and NLU tasks demonstrate the effectiveness and universality of Uni Co TT. 4 EXPERIMENTS 4.3 ABLATION STUDY AND ANALYSIS |
| Researcher Affiliation | Academia | Xianwei Zhuang , Zhihong Zhu , Zhichang Wang , Xuxin Cheng, Yuexian Zou School of Electronic and Computer Engineering, Peking University EMAIL |
| Pseudocode | Yes | To demonstrate the construction process of our Uni Co T more clearly, we provide algorithmic pseudocode for uniformly constructing different structures of Co T, namely Uni Co T, in Alg. 1 and Code 1. Algorithm 1 Algorithm for Iteratively Constructing Uni Co T Code 1: Pseudocode for Iteratively Constructing Uni Co T |
| Open Source Code | Yes | Our code is available at https://github.com/mengchuang123/Uni Co TT. |
| Open Datasets | Yes | Datasets. We conduct experiments across three types of tasks: (1) Factual Reasoning Task. We evaluate our Uni Co TT over CREAK (Onoe et al., 2021), Strategy QA (Geva et al., 2021) and CSQA2 (Talmor et al., 2021) datasets... (2) Multiple-Choice Question Answer. We select CSQA (Talmor et al., 2018), QASC (Khot et al., 2020), and OBQA (Mihaylov et al., 2018) datasets... (3) Natural Language Understanding (NLU). In the realm of NLU, we utilized the Co LA (Warstadt et al., 2019), RTE (Poliak, 2020), MNLI (Williams et al., 2018), MRPC (Dolan & Brockett, 2005) datasets from GLUE benchmark (Wang et al., 2018)... |
| Dataset Splits | No | The paper does not explicitly provide the specific training/test/validation splits (e.g., percentages or exact counts) used for the main experiments. It mentions using a 10% random sample of the CREAK dataset for an ablation study, but not for the primary results. While it refers to 'previous work settings' for evaluation metrics, this doesn't explicitly define the splits used for reproducibility. |
| Hardware Specification | Yes | All experiments using the encoder-only models are conducted on 8 RTX 3090 GPUs. While our primary experiments focus on encoder-based models, we also extend Uni Co TT to decoder-only architectures to validate its generalizability. The experiments using the decoder-only model (i.e., Qwen2.5-3B-Instruct) are conducted on A100-80G GPUs |
| Software Dependencies | No | The paper mentions using specific LLMs like gpt-3.5-turbo-1106 and Qwen2.5-3B-Instruct, and base models like RoBERTa, BERT, and XLNet. It also states: "We implement all methods based on Huggingface Transformers (Wolf et al., 2020)." and "We utilized the LLa MA-factory 2 framework to implement and train our method." However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, the Huggingface Transformers library itself, or LLaMA-factory. |
| Experiment Setup | Yes | The hyperparameter α and β in Eq. 11 are set to 0.5 and 0.2 to achieve the optimal performance in experiments. The hidden size for text is set to 768. We employ Adam as the optimizer with a weight decay of 0.01. We tune all models for 6 epochs and set the learning rate of 3e-6 on all the datasets. When conducting experiments in CSQA and CSQA2, batch sizes were set to 5 and 2, respectively. In addition, we uniformly set the batch size for training on other datasets to 8. |