Logits are All We Need to Adapt Closed Models
Authors: Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, Sanmi Koyejo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models. We conduct extensive experiments across four language generation datasets and three black-box LLMs. |
| Researcher Affiliation | Collaboration | Gaurush Hiranandani 1 Haolun Wu* 2 3 Subhojyoti Mukherjee 4 Sanmi Koyejo 2 1Typeface 2Stanford University 3Mila Quebec AI Institute 4Adobe Research. Correspondence to: Gaurush Hiranandani <EMAIL>, Haolun Wu <EMAIL>. |
| Pseudocode | Yes | The combined approach, integrating probabilities from the black-box and reweighting models, is referred to as the Plugin model. We now detail the training and inference phases, summarized in Algorithm 1 (Appendix A) and illustrated in Figure 1. Algorithm 1 Training and Inference for the Plugin Model |
| Open Source Code | Yes | We provide our code at this https URL. |
| Open Datasets | Yes | We evaluate Plugin on four text generation benchmarks: (a) E2E NLG (Duˇsek et al., 2020), (b) Web NLG (Gardent et al., 2017), (c) Common Gen (Lin et al., 2020), and (d) the Adidas product description dataset (adi, 2023). Adidas us retail products dataset. Kaggle, 2023. URL https://www.kaggle.com/datasets/whenamancodes/adidas-us-retail-products-dataset. |
| Dataset Splits | Yes | For the first three datasets, we use the train-validation-test splits from the Transformers library (Wolf, 2020). The Adidas dataset is split into validation and test sets. Data statistics are provided in Table 6, Appendix C.1. Table 6. Processed Dataset Statistics. E2E NLG 33,525 4,299 4,693 Web NLG 2,732 (filtered by categories) 844 720 Common Gen 1,476 (filtered for man ) 2,026 1,992 Adidas 745 100 |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as specific GPU or CPU models. It mentions 'computational overhead' and 'FLOPs' in Appendix C.4, but not the actual hardware. |
| Software Dependencies | No | The paper mentions using the 'Transformers library (Wolf, 2020)', 'Adam W', and 'NLTK’s word tokenizer' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Learning rate and weight decay are cross-validated over {1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3} and {0.01, 0.1, 1, 10}, respectively. Models are trained using Adam W with warmup followed by linear decay, and early stopping is applied if the hyper-validation loss does not decrease for five consecutive epochs. |