reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

k-NN as a Simple and Effective Estimator of Transferability

Authors: Moein Sorkhei, Christos Matsoukas, Johan Fredin Haslum, Emir Konuk, Kevin Smith

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted an extensive evaluation involving over 42,000 experiments comparing 23 transferability metrics across 16 different datasets to assess their ability to predict transfer performance for image classification tasks.
Researcher Affiliation	Academia	Moein Sorkhei EMAIL KTH Royal Institute of Technology, Stockholm, Sweden Science for Life Laboratory, Stockholm, Sweden
Pseudocode	No	The paper describes mathematical formulations for various metrics (NCE, LEEP, N-LEEP, GBC, FID, EMD, IDS) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using PyTorch for training models but does not provide a link to their own source code, nor does it explicitly state that their code will be released or is available in supplementary materials.
Open Datasets	Yes	We apply transfer learning across a diverse set of 16 image classification datasets. For the source domains, we selected Image Net (Deng et al., 2009), i Nat2017 (Van Horn et al., 2018), Places365 (Zhou et al., 2017), and NABirds (Van Horn et al., 2015). As target datasets, we include well-known benchmarks such as CIFAR-10 and CIFAR-100 Krizhevsky et al. (2009), Caltech-101 (Fei-Fei et al., 2004), Caltech-256 (Griffin et al., 2007), Stanford Dogs (Khosla et al., 2011), Aircraft (Maji et al., 2013), NABirds Van Horn et al. (2015), Oxford-III Pet (Parkhi et al., 2012)), SUN397 (Xiao et al., 2010), DTD (Cimpoi et al., 2014), AID (Xia et al., 2017), and APTOS2019 Karthik (2019).
Dataset Splits	Yes	For each dataset, either the official train/val/test splits were used, or we made the splits following (Kornblith et al., 2019). ... Specifically, we split the training set (S) of the target in two disjoint subsets S1, S2 of 80%-20% of the training set. Subsequently, k-NN classification was performed on S2 using k nearest neighbors from S1. The resulting k-NN accuracy served as the transferability score (to ensure reliability, we repeated the same procedure with 3-fold cross-validation on the training set, yielding identical results).
Hardware Specification	No	We acknowledge the Berzelius computational resources provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre and the the computational resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. This mentions general computational resources and centers but lacks specific hardware details such as GPU/CPU models or memory specifications.
Software Dependencies	Yes	The Adam optimizer (Kingma & Ba, 2014) was used for CNNs and Adam W (Loshchilov & Hutter, 2017) for Vi T-based architectures, and the training of models was done using Py Torch (Paszke et al., 2019).
Experiment Setup	Yes	Images were normalized and resized to 256 256, after which augmentations were applied: random color jittering, random horizontal flip and random cropping to 224 224 of the rescaled image. The Adam optimizer (Kingma & Ba, 2014) was used for CNNs and Adam W (Loshchilov & Hutter, 2017) for Vi T-based architectures... After a grid search, the pretrained and the randomly-initialized models were trained with a learning rate of 10 4 and 3 10 4 respectively, following an initial warm-up for 1,000 iterations. During training, the learning rate was dropped by a factor of 10 whenever the training saturated until it reached a final learning rate of 10 6 or 3 10 6 for pre-trained or randomly-initialized models respectively. The checkpoint with the highest validation performance was finally chosen for final evaluation.