Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Authors: Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures. Our experiments utilize the Diabetes (Kahn), German Credit (Hofmann, 1994), Bank (Moro et al., 2014), Heart, Car, and Adult datasets (Becker & Kohavi, 1996). We compare the computational requirements of our Stability measure against, retraining, dropout-based, Prediction probability, and Adversarial Weight Perturbation (AWP) (Hsu & Calmon, 2022) in terms of both training and evaluation runtimes. We conduct the following ablation studies: We perform an ablation study on the sample size k, observing improved performance with increasing k. We explore the effect of varying the neighborhood radius σ.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, University of Maryland, College Park 2JPMorgan Chase AI Research. Correspondence to: Faisal Hamman <EMAIL>.
Pseudocode No The paper describes methods and definitions but does not contain a clearly labeled pseudocode or algorithm block for the proposed methodology. Lemma 1 and Lemma 2 are presented in the proof section, but they are not algorithms.
Open Source Code No The paper does not contain an explicit statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets Yes Our experiments utilize the Diabetes (Kahn), German Credit (Hofmann, 1994), Bank (Moro et al., 2014), Heart, Car, and Adult datasets (Becker & Kohavi, 1996), serialized using the Text Template, i.e., tabular entry is converted into a natural language: The <column name> is <value> .
Dataset Splits No The paper states: 'The number of shots was set to 64,128, and 512 for each dataset.' This indicates the number of training examples used in a few-shot setting but does not specify the overall training/validation/test splits (e.g., percentages or exact counts for each split) for the datasets, or how the full datasets are partitioned for evaluation.
Hardware Specification Yes All experiments were performed on 2 NVIDIA RTX A4500 and 4 NVIDIA RTX 6000 GPUs.
Software Dependencies No The paper mentions using 'BIGSCIENCE T0 (Sanh et al., 2021) and Google FLAN-T5 (Chung et al., 2024) encoder-decoder models' and 'the T-Few recipe (Liu et al., 2022), and Lo RA (Hu et al., 2021)'. While these are software components and methods, specific version numbers for the underlying libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) are not provided.
Experiment Setup Yes The training process involved setting the batch size to 2 for smaller training sizes and 8 for larger sizes. The learning rate was set to 0.003. For each dataset, we determined the number of training steps adaptively based on the number of shots, ensuring sufficient iterations for model convergence. Specifically, the training steps were calculated as 20 (number of shots/batch size). ...For fine-tuning with Lo RA we use a rank of 4. ...We used error tolerance δ = 0.02, corresponding to a 2% margin of accuracy deviation. ...For the drop-out rate in the baseline, we use p = 0.1 following the recommendation in Hsu et al. (2024).