Learning Harmonized Representations for Speculative Sampling

Authors: Lefan Zhang, Xiaodan Wang, Yanhua Huang, Ruiwen Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on four LLa MA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three datasets, surpassing EAGLE-2 by 8%-20%. The code is available at https://github. com/HArmonized SS/HASS. We conduct experiments across dialogue, code generation, and mathematical reasoning tasks using the MT-bench, Human Eval, and GSM8K datasets, respectively.
Researcher Affiliation Industry Lefan Zhang, Xiaodan Wang, Yanhua Huang , Ruiwen Xu Xiaohongshu Inc. Shanghai, China EMAIL
Pseudocode Yes A.1 IMPLEMENTATION OF HARMONIZED CONTEXT ALIGNMENT We present the pseudo code of harmonized context alignment, which is implemented without the customized attention mask, for better understanding. The actual implementation in our experiments is achieved by the customized attention mask as shown in Figure 3. 1 def train_batch(...) 1 def attention(...)
Open Source Code Yes The code is available at https://github. com/HArmonized SS/HASS.
Open Datasets Yes For multi-turn conversation, code generation, and mathematical reasoning tasks, we choose the MT-bench (Zheng et al., 2024), Human Eval (Chen et al., 2021), and GSM8K (Cobbe et al., 2021) datasets, respectively. We keep other settings, such as the fixed training dataset, i.e., the Share GPT2 dataset with 68,000 dialogues... 2https://huggingface.co/datasets/Aeala/Share GPT Vicuna unfiltered
Dataset Splits No The paper mentions evaluating on MT-bench, Human Eval, and GSM8K datasets but does not provide specific train/test/validation splits for these datasets. It mentions using varying proportions (1/8, 1/4, 1/2, 1/1) of the Share GPT dataset for training, but this relates to the quantity of training data rather than defining dataset splits for evaluation.
Hardware Specification Yes All inference processes are conducted on NVIDIA H800 GPU. ... we train draft models for LLa MA2-Chat 7/13B and LLa MA3-Instruct 8/70B on a single NVIDIA H800 GPU with batch size set to 2 and varied aligning steps
Software Dependencies No Our code is built based on EAGLE-2 s open-source repository1. The paper does not specify versions for software dependencies such as programming languages or libraries.
Experiment Setup Yes The batch size is set as 1 under all the experiments... For harmonized objective distillation, K is set as 10, and the loss of harmonized objective distillation is added to EAGLE s original loss with a coefficient of w = 1.0. For harmonized context alignment, the draft model is aligned for 3 steps during training. For dynamic tree structure, we set the total number of draft tokens to 60 for all experiments with a draft tree depth of 6. We keep other settings, such as the fixed training dataset, i.e., the Share GPT2 dataset with 68,000 dialogues, and the optimizer, consistent with EAGLE-2. Tables 1 and 2 also list results for 'Temperature = 0' and 'Temperature = 1'.