reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WeGeFT: Weight-Generative Fine-Tuning for Multi-Faceted Efficient Adaptation of Large Models

Authors: Chinmay Savadikar, Xi Song, Tianfu Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition verify the effectiveness of our proposed We Ge FT.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA 2An Independent Researcher. Correspondence to: Chinmay Savadikar <EMAIL>, Tianfu Wu <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical formulations (e.g., Eqn. 1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13) and textual descriptions, but does not include a dedicated pseudocode block or a clearly labeled algorithm.
Open Source Code	Yes	Code: https://savadikarc.github.io/wegeft
Open Datasets	Yes	We conduct extensive experiments across Natural Language Generation and Visual Recognition... on the Math10k benchmark (Hu et al., 2023)... Meta Math QA (Yu et al., 2024)... GSM8k test set (Cobbe et al., 2021)... Commonsense170k (Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), Wino G. (Sakaguchi et al., 2021), Arc-e and Arc-c (Clark et al., 2018), and OBQA (Mihaylov et al., 2018) datasets)... Wizard LM dataset (Xu et al., 2024)... MT-Bench dataset (Zheng et al., 2023)... Code-Feedback dataset (Zheng et al., 2024)... Human Eval (Chen et al., 2021)... VTAB-1k benchmark (Zhai et al., 2019)... Caltech-UCSD Birds (Wah et al., 2011), NABirds (Horn et al., 2015), Oxford Flowers (Nilsback & Zisserman, 2008), Stanford Cars (Gebru et al., 2017), and Stanford Dogs (Khosla et al., 2011)... Image Net21k dataset (Deng et al., 2009).
Dataset Splits	Yes	On the Math10k, we follow (Wu et al., 2024), and tune the hyperparameters by fine-tuning the LLa MA-1 (7B) model on the GSM8k dataset (Cobbe et al., 2021) using a separate validation set constructed from the training set... We use the same train, validation and test splits as (Shi et al., 2023), except for Stanford Cars dataset... we create our own training and validation split (with the same number of images as (Shi et al., 2023)) and use the official testing split.
Hardware Specification	Yes	All our experiments are run on a single Nvidia A100 GPU.
Software Dependencies	No	The paper mentions a 'Hugging Face PEFT-based implementation' and 'timm package' in the context of the experiments, and 'Optimizer Adam W', but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The paper includes multiple tables detailing hyperparameters and experimental settings, such as Table 12 'Hyperparameters used for the Math10k experiments', Table 13 'Hyperparameters used for fine-tuning on Meta Math QA and evaluating on GSM8k', Table 14 'Hyperparameters used for the commonsense reasoning experiments', and Table 17 'Hyperparameter search space used for FGVC experiments'. These tables specify values for parameters like Max Sequence Length, Optimizer, Learning Rate, Batch Size, Epochs, Rank, Scaling Factor, Warmup Ratio, Dropout, and Fine-Tuned Layers.