reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hansel: Output Length Controlling Framework for Large Language Models

Authors: Seoha Song, Junhyun Lee, Hyeonmok Ko

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate this by finetuning four different LLMs with Hansel and show that the mean absolute error of the output sequence decreases significantly in every model and dataset compared to the prompt-based length control finetuning. Moreover, the framework showed a substantially improved ability to extrapolate to target lengths unseen during finetuning, such as long dialog responses or extremely short summaries.
Researcher Affiliation	Industry	Samsung Research EMAIL
Pseudocode	No	The paper describes the Hansel framework and dataset augmentation process in prose but does not include any clearly labeled pseudocode or algorithm blocks. It mentions training and inference examples in the Appendix, but the Appendix is not provided in the given text.
Open Source Code	No	The paper does not provide a specific link or explicit statement from the authors about releasing the source code for the Hansel framework or their methodology. It only mentions using 'the Google Research implementation of ROUGE (https://github.com/googleresearch/google-research/tree/master/rouge)' which is a third-party tool.
Open Datasets	Yes	CNN/DM (Hermann et al. 2015) is a large-scale news dataset from the Cable News Network (CNN) and the Daily Mail (DM). XSum (Narayan, Cohen, and Lapata 2018) is a highly abstractive summarization dataset of news articles from the British Broadcasting Corporation (BBC). Daily Dialog (Li et al. 2017) is a multi-turn dialogue dataset on various daily conversation topics, including relationships, ordinary life, and work. Multi WOZ (Zang et al. 2020) is a task-oriented dialogue dataset spanning 8 domains restaurant, hotel, attraction, taxi, train, hospital, bus, and police.
Dataset Splits	No	The paper mentions using 'test sets' but does not specify exact split percentages, absolute sample counts for each split, or reference predefined splits with citations for train/validation/test sets for reproducibility. It only describes how 20% of samples are chosen for a specific training augmentation strategy related to the hyperparameter δ.
Hardware Specification	Yes	Finetuning and inference are conducted using 8 Nvidia Tesla V100 GPUs (40GB).
Software Dependencies	No	The paper mentions using 'Adam W optimizer (Loshchilov and Hutter 2018)' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementing their methodology.
Experiment Setup	Yes	We used batch size 512 and Adam W optimizer (Loshchilov and Hutter 2018) with 5 10 5 learning-rate and parameters β1 = 0.9, β2 = 0.95. We finetune the pretrained LLMs for 2 epochs. ... We set the maximum token number to 1722 and truncated longer examples.