reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Authors: Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures. Our experiments utilize the Diabetes (Kahn), German Credit (Hofmann, 1994), Bank (Moro et al., 2014), Heart, Car, and Adult datasets (Becker & Kohavi, 1996). We compare the computational requirements of our Stability measure against, retraining, dropout-based, Prediction probability, and Adversarial Weight Perturbation (AWP) (Hsu & Calmon, 2022) in terms of both training and evaluation runtimes. We conduct the following ablation studies: We perform an ablation study on the sample size k, observing improved performance with increasing k. We explore the effect of varying the neighborhood radius σ.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, University of Maryland, College Park 2JPMorgan Chase AI Research. Correspondence to: Faisal Hamman <EMAIL>.
Pseudocode	No	The paper describes methods and definitions but does not contain a clearly labeled pseudocode or algorithm block for the proposed methodology. Lemma 1 and Lemma 2 are presented in the proof section, but they are not algorithms.
Open Source Code	No	The paper does not contain an explicit statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	Our experiments utilize the Diabetes (Kahn), German Credit (Hofmann, 1994), Bank (Moro et al., 2014), Heart, Car, and Adult datasets (Becker & Kohavi, 1996), serialized using the Text Template, i.e., tabular entry is converted into a natural language: The <column name> is <value> .
Dataset Splits	No	The paper states: 'The number of shots was set to 64,128, and 512 for each dataset.' This indicates the number of training examples used in a few-shot setting but does not specify the overall training/validation/test splits (e.g., percentages or exact counts for each split) for the datasets, or how the full datasets are partitioned for evaluation.
Hardware Specification	Yes	All experiments were performed on 2 NVIDIA RTX A4500 and 4 NVIDIA RTX 6000 GPUs.
Software Dependencies	No	The paper mentions using 'BIGSCIENCE T0 (Sanh et al., 2021) and Google FLAN-T5 (Chung et al., 2024) encoder-decoder models' and 'the T-Few recipe (Liu et al., 2022), and Lo RA (Hu et al., 2021)'. While these are software components and methods, specific version numbers for the underlying libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) are not provided.
Experiment Setup	Yes	The training process involved setting the batch size to 2 for smaller training sizes and 8 for larger sizes. The learning rate was set to 0.003. For each dataset, we determined the number of training steps adaptively based on the number of shots, ensuring sufficient iterations for model convergence. Specifically, the training steps were calculated as 20 (number of shots/batch size). ...For fine-tuning with Lo RA we use a rank of 4. ...We used error tolerance δ = 0.02, corresponding to a 2% margin of accuracy deviation. ...For the drop-out rate in the baseline, we use p = 0.1 following the recommendation in Hsu et al. (2024).