reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Harmonized Representations for Speculative Sampling

Authors: Lefan Zhang, Xiaodan Wang, Yanhua Huang, Ruiwen Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four LLa MA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three datasets, surpassing EAGLE-2 by 8%-20%. The code is available at https://github. com/HArmonized SS/HASS. We conduct experiments across dialogue, code generation, and mathematical reasoning tasks using the MT-bench, Human Eval, and GSM8K datasets, respectively.
Researcher Affiliation	Industry	Lefan Zhang, Xiaodan Wang, Yanhua Huang , Ruiwen Xu Xiaohongshu Inc. Shanghai, China EMAIL
Pseudocode	Yes	A.1 IMPLEMENTATION OF HARMONIZED CONTEXT ALIGNMENT We present the pseudo code of harmonized context alignment, which is implemented without the customized attention mask, for better understanding. The actual implementation in our experiments is achieved by the customized attention mask as shown in Figure 3. 1 def train_batch(...) 1 def attention(...)
Open Source Code	Yes	The code is available at https://github. com/HArmonized SS/HASS.
Open Datasets	Yes	For multi-turn conversation, code generation, and mathematical reasoning tasks, we choose the MT-bench (Zheng et al., 2024), Human Eval (Chen et al., 2021), and GSM8K (Cobbe et al., 2021) datasets, respectively. We keep other settings, such as the fixed training dataset, i.e., the Share GPT2 dataset with 68,000 dialogues... 2https://huggingface.co/datasets/Aeala/Share GPT Vicuna unfiltered
Dataset Splits	No	The paper mentions evaluating on MT-bench, Human Eval, and GSM8K datasets but does not provide specific train/test/validation splits for these datasets. It mentions using varying proportions (1/8, 1/4, 1/2, 1/1) of the Share GPT dataset for training, but this relates to the quantity of training data rather than defining dataset splits for evaluation.
Hardware Specification	Yes	All inference processes are conducted on NVIDIA H800 GPU. ... we train draft models for LLa MA2-Chat 7/13B and LLa MA3-Instruct 8/70B on a single NVIDIA H800 GPU with batch size set to 2 and varied aligning steps
Software Dependencies	No	Our code is built based on EAGLE-2 s open-source repository1. The paper does not specify versions for software dependencies such as programming languages or libraries.
Experiment Setup	Yes	The batch size is set as 1 under all the experiments... For harmonized objective distillation, K is set as 10, and the loss of harmonized objective distillation is added to EAGLE s original loss with a coefficient of w = 1.0. For harmonized context alignment, the draft model is aligned for 3 steps during training. For dynamic tree structure, we set the total number of draft tokens to 60 for all experiments with a draft tree depth of 6. We keep other settings, such as the fixed training dataset, i.e., the Share GPT2 dataset with 68,000 dialogues, and the optimizer, consistent with EAGLE-2. Tables 1 and 2 also list results for 'Temperature = 0' and 'Temperature = 1'.