reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

Authors: Rohan Deb, Kiran Koshy Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Fisher SFT empirically in Section 5. Our experiments on synthetic problems show that Fisher SFT yields a lower prediction error than the baselines. We also fine-tune GPT-2 models and evaluate them using an LLM-as-a-judge (Zheng et al., 2023). The judge prefers the text generated by Fisher SFT models by a large margin.
Researcher Affiliation	Collaboration	1University of Illinois, Urbana-Champaign. The work was done during an internship at Amazon 2Amazon 3Typeface 4Technion 5Adobe Research.
Pseudocode	Yes	Algorithm 1 Greedy optimal design for language models. Algorithm 2 Fisher SFT: Fast Implementation of Algorithm 1.
Open Source Code	Yes	Our implementation is available at github.
Open Datasets	Yes	We consider two corpora: tiny Shakespeare corpus (Karpathy, 2015) and Sherlock Holmes corpus (Doyle). We subsample 10 000 sentences from both corpora and experiment with learning from n [100, 5 000] sentences. We use pre-trained word2vec embeddings (Mikolov et al., 2013) of dimension 300. Sherlock holmes collection. https://www.kaggle.com/datasets/bharatkumar0925/sherlock-holmes-collection.
Dataset Splits	No	The paper states: "We subsample 10 000 sentences from both corpora and experiment with learning from n [100, 5 000] sentences." and "All methods choose n sentences and learn a multinomial logistic regression model by solving (5).". However, it does not provide explicit training, validation, and test splits (e.g., percentages or counts) for reproducible evaluation. The GPT-2 evaluation uses a generative approach rather than a traditional split.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies	No	The paper mentions using "GPT-2 models (Radford et al., 2019)" and fine-tuning on "Hugging Face (Wolf et al., 2020)", but it does not specify version numbers for these or other ancillary software components (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup	No	The paper describes experiments including fine-tuning GPT-2 models and training a multinomial logistic regression model, but it does not specify concrete experimental setup details such as learning rates, batch sizes, number of epochs, optimizers, or other hyperparameters required to replicate the training process.