reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

Authors: Edan kinderman, Itay Hubara, Haggai Maron, Daniel Soudry

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios1. ... Table 1: Merging a pair of Vi T-B-16, fine-tuned on Cars and CIFAR10, using 100 original images and 800 augmented images from each dataset. The test accuracy is averaged on both tasks.
Researcher Affiliation	Collaboration	Edan Kinderman EMAIL Electrical and Computer Engineering Department, Technion Itay Hubara EMAIL Habana Labs An Intel company Haggai Maron EMAIL Electrical and Computer Engineering Department, Technion NVIDIA Research Daniel Soudry EMAIL Electrical and Computer Engineering Department, Technion
Pseudocode	No	The paper describes methods like FS-Merge and folding operations using mathematical equations and prose (e.g., in Sections 2.1, 2.2, B.1, B.2, B.3), but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step procedures.
Open Source Code	Yes	1Code is available at https://github.com/idankinderman/fs_merge.
Open Datasets	Yes	fine-tuned each on separate tasks (Cars (Krause et al., 2013) and CIFAR10 (Krizhevsky et al., 2009))... MNIST (Le Cun, 1998)... GLUE dataset (Wang et al., 2019)... Image Net-1k (Deng et al., 2009)
Dataset Splits	Yes	MNIST (Le Cun, 1998) was split into two subsets: images with labels 0 4 and images with labels 5 9. These subsets were further divided into training, validation, and test sets. If there wasn t a validation set, one was created by using 15% of the training set. ... Table 3: Merging pairs of Vi T-B-16 using 16 original images from each training set and 800 augmented images from each dataset. ... Table 4: Merging groups of 4 Vi T-B-16 using 100 original and 1000 augmented images per training dataset (a total of 400 original images and 4,000 augmented images).
Hardware Specification	No	The paper discusses 'resource-intensive techniques' and 'memory and compute resource demands' but does not specify any particular hardware components like CPU/GPU models, processor types, or memory amounts used for the experiments.
Software Dependencies	No	We used GD optimizer in FS-Merge and Distillation... an ADAMW optimizer with a weight decay of 0.001... The paper mentions specific optimizers and concepts like 'cross-entropy loss' and 'cosine scheduler' but does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	All models were fine-tuned with a batch size of 256, a learning rate of 1e 5, cross-entropy loss, and a cosine scheduler using a single cycle with a warm-up phase. ... For all methods that require training (FS-Merge and distillation), a batch size of 128 was used, along with a cosine scheduler that utilized a single cycle with a warmup phase, an ADAMW optimizer with a weight decay of 0.001, and initialization from the first model ( First ).