reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

L3Ms — Lagrange Large Language Models

Authors: Guneet Singh Dhillon, Xingjian Shi, Yee Whye Teh, Alex Smola

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate the versatility and efficacy of L3Ms in achieving tailored alignments for various applications. 6 EXPERIMENTAL RESULTS
Researcher Affiliation	Collaboration	Guneet S. Dhillon 1 , Xingjian Shi 2, Yee Whye Teh 1, Alex Smola 2 1 University of Oxford, 2 Boson AI
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper. The derivations in Section 5.1 and Appendix B are mathematical, not algorithmic.
Open Source Code	Yes	Our code, based on the Transformers library (Wolf et al., 2020), is available at: https://github.com/Guneet-Dhillon/l3m.
Open Datasets	Yes	We use Ultra Chat (Ding et al., 2023), a large-scale dataset of instructional conversations, as our task data to induce instruction-following capabilities. We use the Helpful and Harmless (Bai et al., 2022) preference data to learn two reward models, respectively.
Dataset Splits	Yes	Consequently, we obtain 340k training samples, 1.7k validation samples, and 1.7 test samples, split randomly since the dataset does not contain train-val-test splits.
Hardware Specification	Yes	We run all experiments on NVIDIA H100s.
Software Dependencies	No	The paper mentions the 'Transformers library (Wolf et al., 2020)' but does not provide a specific version number for it or any other software dependencies like Python or PyTorch.
Experiment Setup	Yes	We fine-tune LLMs for 1 epoch on the task data, with a mini-batch size of 64. We use Adam with a learning rate of 10-6 and a cosine learning rate scheduler (with 5% of the epoch used for warmup). We set weight decay to 0.1 and the gradient clipping maximum norm to 1. We utilize 16-bit (mixed) precision training and gradient checkpointing. We exponentially decay the log-barrier parameter µ during fine-tuning from 1 to 10-6 and use a smoothing factor of 0.1 for the exponential moving average. Lastly, we use top-p sampling (p set to 0.9) for response generation.