reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

Authors: Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results, covering both vision and language models, demonstrate that the PRT-trained model can achieve comparable accuracy to the existing work of inference-time tuning, with less inference cost. In Section 4, the paper presents 'Experiments' with detailed 'Results' (Figures 2, 3, 4) and 'Memory and Speed Analysis' (Tables 1, 2, 3), indicating empirical evaluation.
Researcher Affiliation	Industry	1NTT Computer and Data Science Laboratories, NTT Corporation 2NTT Human Informatics Laboratories, NTT Corporation. Correspondence to: Daiki Chijiwa <EMAIL>, Taku Hasegawa <EMAIL>. All authors are affiliated with NTT Corporation, an industry entity, and the email domains are @ntt.com.
Pseudocode	Yes	Algorithm 1 Pseudocode for Training of PRT. Algorithm 2 Pseudocode for Inference of PRT.
Open Source Code	No	The text does not contain an explicit statement from the authors about releasing their own code for the methodology described in this paper, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We employed CLIP models (...) pretrained on various datasets including (...) LAION-400M (Schuhmann et al., 2021), LAION-2B (Schuhmann et al., 2022), and Data Comp-1B (Gadre et al., 2024). (...) For each fine-grained dataset, such as Cars (Krause et al., 2013) and CUB (Wah et al., 2011). (...) Aircraft (Maji et al., 2013), Caltech101 (Li et al., 2022), Cars (Krause et al., 2013), CIFAR-100 (Krizhevsky et al., 2009), Country211 (Radford et al., 2021), CUB (Wah et al., 2011), Flowers (Nilsback & Zisserman, 2008), RESISC45 (Cheng et al., 2017). (...) Tulu v2 dataset (Ivison et al., 2023). (...) GSM8K (Cobbe et al., 2021) and IFEval (Zhou et al., 2023).
Dataset Splits	Yes	For each fine-grained dataset, such as Cars (Krause et al., 2013) and CUB (Wah et al., 2011), we first constructed and fixed the classification layer of each pretrained model for zeroshot classification, and then fine-tuned (or reward-tuned) its feature extractor on the train set. (3) Evaluation: We evaluated models on the test set of each dataset where the training set was used for tuning the models.
Hardware Specification	Yes	All models are trained on a single A100 GPU. We conducted all training on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions optimizers (Adam) but does not provide specific software library names with version numbers (e.g., PyTorch, TensorFlow) or specific Python versions.
Experiment Setup	Yes	In training of either standard fine-tuning or PRT, we used the same hyperparameters following existing work (Ilharco et al., 2023) as follows: learning rate = 1e-5, batch size = 128, number of iterations = 2000, optimizer = Adam, cosine annealing with 500 warmup iterations. The training conditions are as follows: learning rate = 2e-5, batch size = 128, number of epochs = 2, optimizer = Adam, warmup ratio = 0.03, and learning rate scheduler = linear.