reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pruning Feature Extractor Stacking for Cross-domain Few-shot Learning

Authors: Hongyu Wang, Eibe Frank, Bernhard Pfahringer, Geoff Holmes

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform evaluation on Meta-Dataset with the extended set of target domains and adhere to the official sampling method to generate 600 few-shot episodes for each domain: each episode contains 5 to 50 classes, up to 100 support set instances per class, up to 500 (potentially class-imbalanced) support set instances in total, and 10 query set instances per class. The sampled episodes are cached to ensure that all pruning strategies are evaluated using the same episodes. We use FES with TSA fine-tuning to evaluate pruning strategies. Table 1 compares BSS to the baselines.
Researcher Affiliation	Academia	Hongyu Wang EMAIL Department of Computer Science University of Waikato Eibe Frank EMAIL Department of Computer Science University of Waikato Bernhard Pfahringer EMAIL Department of Computer Science University of Waikato Geoffrey Holmes EMAIL Department of Computer Science University of Waikato
Pseudocode	Yes	Algorithm 1 Bidirectional snapshot selection
Open Source Code	No	The paper does not provide a direct link to a source-code repository, an explicit statement about code release for the methodology described in this paper, or indicate code availability in supplementary materials. The Open Review link is for the review process, not code.
Open Datasets	Yes	We perform evaluation on Meta-Dataset with the extended set of target domains... Meta-Dataset (Triantafillou et al., 2020) is a benchmark for evaluating CDFSL methods, originally containing eight domains, ilsvrc_2012, omniglot, aircraft, cu_birds, dtd, quickdraw, fungi, and vgg_flower, and two target domains from which episodes can be sampled: traffic_sign and mscoco. Three additional target domains mnist, cifar10, and cifar100 were added by Requeima et al. (2019), and a further five, namely, Crop Disease, Euro SAT, ISIC, Chest X, and Food101, were added by Wang et al. (2024).
Dataset Splits	Yes	We perform evaluation on Meta-Dataset with the extended set of target domains and adhere to the official sampling method to generate 600 few-shot episodes for each domain: each episode contains 5 to 50 classes, up to 100 support set instances per class, up to 500 (potentially class-imbalanced) support set instances in total, and 10 query set instances per class. BSS leverages the 2-fold stratified cross-validation performed by FES to guide the search, where the support set is split into two partitions of instances, which alternate to serve for extractor fine-tuning and logit extraction
Hardware Specification	Yes	Fine-tuning and stacking are performed on an NVIDIA A6000 GPU, while pruning is performed on an Intel Core i7-6700K CPU
Software Dependencies	No	The paper mentions ResNet18 as an extractor architecture and TSA for fine-tuning but does not provide specific version numbers for software libraries or dependencies like PyTorch, TensorFlow, Python, or CUDA.
Experiment Setup	Yes	Each of the eight source domain extractors is fine-tuned with TSA for 40 iterations, leading to 8 41 = 328 snapshots in total. We also evaluate semi-supervised FES with TSA fine-tuning and STC, using the hyperparameters from Wang et al. (2023). We evaluate BSS with patience = 0 as our main method. In an ablation study, we evaluate two unidirectional snapshot selection strategies... We also evaluate BSS, UFSS, and UBSS with patience values from 0, i.e., no patience, to 328, which ensures depletion of the candidate pool. We perform paired t-tests with p = 0.05 using accuracy of individual episodes to determine statistical significance of differences in accuracy on individual datasets. FES with STC semi-supervised learning with 1000 unlabelled instances.