L3Ms — Lagrange Large Language Models
Authors: Guneet Singh Dhillon, Xingjian Shi, Yee Whye Teh, Alex Smola
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally demonstrate the versatility and efficacy of L3Ms in achieving tailored alignments for various applications. 6 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Collaboration | Guneet S. Dhillon 1 , Xingjian Shi 2, Yee Whye Teh 1, Alex Smola 2 1 University of Oxford, 2 Boson AI |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. The derivations in Section 5.1 and Appendix B are mathematical, not algorithmic. |
| Open Source Code | Yes | Our code, based on the Transformers library (Wolf et al., 2020), is available at: https://github.com/Guneet-Dhillon/l3m. |
| Open Datasets | Yes | We use Ultra Chat (Ding et al., 2023), a large-scale dataset of instructional conversations, as our task data to induce instruction-following capabilities. We use the Helpful and Harmless (Bai et al., 2022) preference data to learn two reward models, respectively. |
| Dataset Splits | Yes | Consequently, we obtain 340k training samples, 1.7k validation samples, and 1.7 test samples, split randomly since the dataset does not contain train-val-test splits. |
| Hardware Specification | Yes | We run all experiments on NVIDIA H100s. |
| Software Dependencies | No | The paper mentions the 'Transformers library (Wolf et al., 2020)' but does not provide a specific version number for it or any other software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | We fine-tune LLMs for 1 epoch on the task data, with a mini-batch size of 64. We use Adam with a learning rate of 10-6 and a cosine learning rate scheduler (with 5% of the epoch used for warmup). We set weight decay to 0.1 and the gradient clipping maximum norm to 1. We utilize 16-bit (mixed) precision training and gradient checkpointing. We exponentially decay the log-barrier parameter µ during fine-tuning from 1 to 10-6 and use a smoothing factor of 0.1 for the exponential moving average. Lastly, we use top-p sampling (p set to 0.9) for response generation. |