reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Utility of Existing Fine-Tuned Models on Data-Scarce Domains

Authors: Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we explore different utilization techniques of these existing DAFT models for data-scarce problems, i.e., tasks for which data is not available or limited. We observe that for zero-shot problems, ensembling of DAFT models provides an accuracy performance close to that of the single best model. With few-shot problems (few data from target domain available), this performance can be improved further by picking or putting more weights to the DAFT models that are expected to perform better on the target task.
Researcher Affiliation	Collaboration	Md Ibrahim Ibne Alam EMAIL Department of Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute Parikshit Ram EMAIL IBM Soham Dan EMAIL Microsoft Horst Samulowitz EMAIL IBM Koushik Kar EMAIL Department of Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute
Pseudocode	No	The paper describes methods (DAFT-EZ, DAFT-E) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to using existing fine-tuned models from Hugging-face (HF, 2024c; Kim, 2023) and creating their own DAFT models by fine-tuning base models. There is no explicit statement or link provided for the open-sourcing of the code developed for this paper's methodology.
Open Datasets	Yes	The datasets for Sentiment Analysis are: Amazon polarity, Cornell Movie, IMDB, SST2, Tweet sentiment, and Yelp Polarity; and for Textual Similarity: MRPC, QQP, STS-B (details in Appendix A.1). ... The direct link to download these datasets are given as follows: https:// huggingface.co /datasets/amazon_polarity, https:// www.cs.cornell.edu /people/pabo/movie-review-data/ , https://huggingface.co/datasets/ imdb , https:// huggingface.co/ datasets/sst2 , https:// huggingface.co/ datasets/ mteb/tweet_sentiment_extraction , https:// huggingface.co/ datasets/ yelp_polarity , https://huggingface.co /datasets/nyu-mll/glue
Dataset Splits	Yes	Let us denote the train and test splits of the target dataset as D T and D T respectively. ... In our experiments, with few shot fine-tuning, we vary n in the range of 2 256 samples. ... For DAFT-E, each dataset was split in half: one half was used to tune the linear weights of DAFT-E, and the other half was used for performance evaluation9. ... Due to the small size of each dataset, we repeated the random split 200 times for each of the 9 datasets and report the average performance.
Hardware Specification	Yes	To fine-tune and train these models we used Google Colab platform with the T4 GPU equipped machine.
Software Dependencies	No	The paper mentions using 'SGDRegressor from sklearn.linear_model' and 'Random Forest Classifier from sklearn.ensemble', as well as 'huggingface using the Auto Tokenizer.from_pretrained', but does not provide specific version numbers for these software libraries or packages.
Experiment Setup	Yes	In our experiments, with few shot fine-tuning, we vary n in the range of 2 256 samples. For the (FFT) models, we fine-tune until loss stabilization3. ... For LR we have used SGDRegressor from sklearn.linear_model with maximum iteration of 3. ... We also used coefficient initialization = 1/N, where N is the number of DAFT models used... For RF we imported the Random Forest Classifier from sklearn.ensemble, and set the max depth = 2. ... For both FT and DA(FT)2 on few shot training, we performed the all the runs five times with five different seeds. For the case of DAFT-E, the weight calculations were done using five different random seeds as well.