reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Continuous Language Model Interpolation yields Dynamic and Controllable Text Generation

Authors: Sara Kangaslahti, David Alvarez-Melis

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that varying the interpolation weights yields predictable and consistent change in the model outputs with respect to all of the controlled attributes simultaneously. We evaluate the ability of weight interpolation to control the outputs of LLMs on five commonly used style attributes defined in prior style transfer literature (Jin et al., 2022).
Researcher Affiliation	Academia	Sara Kangaslahti EMAIL School of Engineering and Applied Sciences Harvard University David Alvarez-Melis EMAIL Kempner Institute Harvard University
Pseudocode	No	The paper includes mathematical equations (e.g., Equation 1, 2, 3) and descriptive text for its methods but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present procedures in a code-like structured format.
Open Source Code	Yes	Code: https://github.com/skangasl/continuous-lm-interpolation
Open Datasets	Yes	For simplicity, we use the Tiny Stories dataset (Eldan & Li, 2023) to fine-tune a simple model and novel chapters from the Book Sum dataset (Kryscinski et al., 2021) to fine-tune a complex model. We use the documents classified as formal and informal in Grammarly s Yahoo Answers Formality Corpus (GYAFC) dataset (Rao & Tetreault, 2018) to fine-tune formal and informal models. For the politeness attribute, we use the documents in the highest and lowest politeness class in the work by Madaan et al. (2020) for fine-tuning polite and impolite models, respectively. We fine-tune positive and negative sentiment models using the Stanford Sentiment Treebank (SST-2) dataset (Socher et al., 2013). For humor, we use the Flickr Style dataset (Gan et al., 2017) to fine-tune humorous and non-humorous models. To evaluate the interpolated models, we use a subset of 1k randomly sampled prompts from the Writing Prompts dataset (Fan et al., 2018) and generate 3 continuations for each prompt. We also compute perplexity on the test split of the Wiki Text dataset (Merity et al., 2016).
Dataset Splits	Yes	Table 3: Fine-tuning splits. We report the number of examples from each attribute dataset used to fine-tune Llama2-7b generation and RoBERTa attribute scoring models. Each split is sampled from the combined train, test, and validation set. Domain Llama2 split size RoBERTa split size Class 0 Class 1 Sentiment Socher et al. (2013) 25k 30k 10k Politeness Madaan et al. (2020) 78k 100k 20k Formality Rao & Tetreault (2018) 104k 104k 10k Simplicity (Kryscinski et al., 2021; Eldan & Li, 2023) 9k 100k 10k Humor Gan et al. (2017) 100k 100k 20k
Hardware Specification	Yes	All experiments were run on single NVIDIA A100 80GB SXM GPU nodes.
Software Dependencies	No	The paper mentions using specific models like Llama2-7b and RoBERTa, and techniques like LoRA. However, it does not provide specific version numbers for ancillary software such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or other libraries used in the implementation of the experiments.
Experiment Setup	Yes	Table 2: Parameters for LoRA fine-tuning. We use 20 epochs for fine-tuning the sentiment attribute models and 1 epoch for the remaining fine-tuned models. LoRA hyperparameter Value Batch size 64 Learning rate 5e-5 LoRA r 32 LoRA α 16 LoRA dropout 0.1 Max sequence length 128 Quantization 4 bit