Hansel: Output Length Controlling Framework for Large Language Models

Authors: Seoha Song, Junhyun Lee, Hyeonmok Ko

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate this by finetuning four different LLMs with Hansel and show that the mean absolute error of the output sequence decreases significantly in every model and dataset compared to the prompt-based length control finetuning. Moreover, the framework showed a substantially improved ability to extrapolate to target lengths unseen during finetuning, such as long dialog responses or extremely short summaries.
Researcher Affiliation Industry Samsung Research EMAIL
Pseudocode No The paper describes the Hansel framework and dataset augmentation process in prose but does not include any clearly labeled pseudocode or algorithm blocks. It mentions training and inference examples in the Appendix, but the Appendix is not provided in the given text.
Open Source Code No The paper does not provide a specific link or explicit statement from the authors about releasing the source code for the Hansel framework or their methodology. It only mentions using 'the Google Research implementation of ROUGE (https://github.com/googleresearch/google-research/tree/master/rouge)' which is a third-party tool.
Open Datasets Yes CNN/DM (Hermann et al. 2015) is a large-scale news dataset from the Cable News Network (CNN) and the Daily Mail (DM). XSum (Narayan, Cohen, and Lapata 2018) is a highly abstractive summarization dataset of news articles from the British Broadcasting Corporation (BBC). Daily Dialog (Li et al. 2017) is a multi-turn dialogue dataset on various daily conversation topics, including relationships, ordinary life, and work. Multi WOZ (Zang et al. 2020) is a task-oriented dialogue dataset spanning 8 domains restaurant, hotel, attraction, taxi, train, hospital, bus, and police.
Dataset Splits No The paper mentions using 'test sets' but does not specify exact split percentages, absolute sample counts for each split, or reference predefined splits with citations for train/validation/test sets for reproducibility. It only describes how 20% of samples are chosen for a specific training augmentation strategy related to the hyperparameter δ.
Hardware Specification Yes Finetuning and inference are conducted using 8 Nvidia Tesla V100 GPUs (40GB).
Software Dependencies No The paper mentions using 'Adam W optimizer (Loshchilov and Hutter 2018)' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementing their methodology.
Experiment Setup Yes We used batch size 512 and Adam W optimizer (Loshchilov and Hutter 2018) with 5 10 5 learning-rate and parameters β1 = 0.9, β2 = 0.95. We finetune the pretrained LLMs for 2 epochs. ... We set the maximum token number to 1722 and truncated longer examples.