reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Logits are All We Need to Adapt Closed Models

Authors: Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, Sanmi Koyejo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models. We conduct extensive experiments across four language generation datasets and three black-box LLMs.
Researcher Affiliation	Collaboration	Gaurush Hiranandani 1 Haolun Wu* 2 3 Subhojyoti Mukherjee 4 Sanmi Koyejo 2 1Typeface 2Stanford University 3Mila Quebec AI Institute 4Adobe Research. Correspondence to: Gaurush Hiranandani <EMAIL>, Haolun Wu <EMAIL>.
Pseudocode	Yes	The combined approach, integrating probabilities from the black-box and reweighting models, is referred to as the Plugin model. We now detail the training and inference phases, summarized in Algorithm 1 (Appendix A) and illustrated in Figure 1. Algorithm 1 Training and Inference for the Plugin Model
Open Source Code	Yes	We provide our code at this https URL.
Open Datasets	Yes	We evaluate Plugin on four text generation benchmarks: (a) E2E NLG (Duˇsek et al., 2020), (b) Web NLG (Gardent et al., 2017), (c) Common Gen (Lin et al., 2020), and (d) the Adidas product description dataset (adi, 2023). Adidas us retail products dataset. Kaggle, 2023. URL https://www.kaggle.com/datasets/whenamancodes/adidas-us-retail-products-dataset.
Dataset Splits	Yes	For the first three datasets, we use the train-validation-test splits from the Transformers library (Wolf, 2020). The Adidas dataset is split into validation and test sets. Data statistics are provided in Table 6, Appendix C.1. Table 6. Processed Dataset Statistics. E2E NLG 33,525 4,299 4,693 Web NLG 2,732 (filtered by categories) 844 720 Common Gen 1,476 (filtered for man ) 2,026 1,992 Adidas 745 100
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as specific GPU or CPU models. It mentions 'computational overhead' and 'FLOPs' in Appendix C.4, but not the actual hardware.
Software Dependencies	No	The paper mentions using the 'Transformers library (Wolf, 2020)', 'Adam W', and 'NLTK’s word tokenizer' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Learning rate and weight decay are cross-validated over {1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3} and {0.01, 0.1, 1, 10}, respectively. Models are trained using Adam W with warmup followed by linear decay, and early stopping is applied if the hyper-validation loss does not decrease for five consecutive epochs.