FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

Authors: Rohan Deb, Kiran Koshy Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Fisher SFT empirically in Section 5. Our experiments on synthetic problems show that Fisher SFT yields a lower prediction error than the baselines. We also fine-tune GPT-2 models and evaluate them using an LLM-as-a-judge (Zheng et al., 2023). The judge prefers the text generated by Fisher SFT models by a large margin.
Researcher Affiliation Collaboration 1University of Illinois, Urbana-Champaign. The work was done during an internship at Amazon 2Amazon 3Typeface 4Technion 5Adobe Research.
Pseudocode Yes Algorithm 1 Greedy optimal design for language models. Algorithm 2 Fisher SFT: Fast Implementation of Algorithm 1.
Open Source Code Yes Our implementation is available at github.
Open Datasets Yes We consider two corpora: tiny Shakespeare corpus (Karpathy, 2015) and Sherlock Holmes corpus (Doyle). We subsample 10 000 sentences from both corpora and experiment with learning from n [100, 5 000] sentences. We use pre-trained word2vec embeddings (Mikolov et al., 2013) of dimension 300. Sherlock holmes collection. https://www.kaggle.com/datasets/bharatkumar0925/sherlock-holmes-collection.
Dataset Splits No The paper states: "We subsample 10 000 sentences from both corpora and experiment with learning from n [100, 5 000] sentences." and "All methods choose n sentences and learn a multinomial logistic regression model by solving (5).". However, it does not provide explicit training, validation, and test splits (e.g., percentages or counts) for reproducible evaluation. The GPT-2 evaluation uses a generative approach rather than a traditional split.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper mentions using "GPT-2 models (Radford et al., 2019)" and fine-tuning on "Hugging Face (Wolf et al., 2020)", but it does not specify version numbers for these or other ancillary software components (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup No The paper describes experiments including fine-tuning GPT-2 models and training a multinomial logistic regression model, but it does not specify concrete experimental setup details such as learning rates, batch sizes, number of epochs, optimizers, or other hyperparameters required to replicate the training process.