Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

Authors: Edan kinderman, Itay Hubara, Haggai Maron, Daniel Soudry

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios1. ... Table 1: Merging a pair of Vi T-B-16, fine-tuned on Cars and CIFAR10, using 100 original images and 800 augmented images from each dataset. The test accuracy is averaged on both tasks.
Researcher Affiliation Collaboration Edan Kinderman EMAIL Electrical and Computer Engineering Department, Technion Itay Hubara EMAIL Habana Labs An Intel company Haggai Maron EMAIL Electrical and Computer Engineering Department, Technion NVIDIA Research Daniel Soudry EMAIL Electrical and Computer Engineering Department, Technion
Pseudocode No The paper describes methods like FS-Merge and folding operations using mathematical equations and prose (e.g., in Sections 2.1, 2.2, B.1, B.2, B.3), but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step procedures.
Open Source Code Yes 1Code is available at https://github.com/idankinderman/fs_merge.
Open Datasets Yes fine-tuned each on separate tasks (Cars (Krause et al., 2013) and CIFAR10 (Krizhevsky et al., 2009))... MNIST (Le Cun, 1998)... GLUE dataset (Wang et al., 2019)... Image Net-1k (Deng et al., 2009)
Dataset Splits Yes MNIST (Le Cun, 1998) was split into two subsets: images with labels 0 4 and images with labels 5 9. These subsets were further divided into training, validation, and test sets. If there wasn t a validation set, one was created by using 15% of the training set. ... Table 3: Merging pairs of Vi T-B-16 using 16 original images from each training set and 800 augmented images from each dataset. ... Table 4: Merging groups of 4 Vi T-B-16 using 100 original and 1000 augmented images per training dataset (a total of 400 original images and 4,000 augmented images).
Hardware Specification No The paper discusses 'resource-intensive techniques' and 'memory and compute resource demands' but does not specify any particular hardware components like CPU/GPU models, processor types, or memory amounts used for the experiments.
Software Dependencies No We used GD optimizer in FS-Merge and Distillation... an ADAMW optimizer with a weight decay of 0.001... The paper mentions specific optimizers and concepts like 'cross-entropy loss' and 'cosine scheduler' but does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes All models were fine-tuned with a batch size of 256, a learning rate of 1e 5, cross-entropy loss, and a cosine scheduler using a single cycle with a warm-up phase. ... For all methods that require training (FS-Merge and distillation), a batch size of 128 was used, along with a cosine scheduler that utilized a single cycle with a warmup phase, an ADAMW optimizer with a weight decay of 0.001, and initialization from the first model ( First ).