reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unsupervised Model Tree Heritage Recovery

Authors: Eliahu Horwitz, Asaf Shul, Yedid Hoshen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In extensive experiments we demonstrate that our method successfully reconstructs complex Model Trees. To evaluate the Mo THer Recovery task, we introduce the Mo THer dataset, a Model Graph comprising of over 500 models from diverse architectures and modalities. We use accuracy as the evaluation metric, a correct prediction is one where both the edge placement and direction are correct. In all our tests, Mo THer ran in seconds to minutes even on a CPU (see App. F for more details).
Researcher Affiliation	Academia	Eliahu Horwitz, Asaf Shul, Yedid Hoshen School of Computer Science and Engineering The Hebrew University of Jerusalem, Israel EMAIL
Pseudocode	Yes	We can recover the Model Tree from M using a minimum directed spanning tree (MDST) algorithm. In this paper, we employ the Chu-Liu-Edmonds algorithm (Chu, 1965; Edmonds et al., 1967), which iteratively contracts cycles in the graph until forming a tree. The algorithm proceeds as follows: initially, it treats each node as a temporary tree. Then, it merges the temporary trees via the incoming edge with the minimum weight. Subsequently, it identifies cycles in the remaining temporary trees and removes the edge with the highest weight. This merging process continues until all cycles are eliminated, resulting in the minimum directed spanning tree.
Open Source Code	Yes	We include the our code in the supplementary material and will make it publicly available through github upon acceptance.
Open Datasets	Yes	To evaluate the Mo THer Recovery task, we introduce the Mo THer dataset, a Model Graph comprising of over 500 models from diverse architectures and modalities. ... We will also upload the entire Mo THer dataset to Hugging Face. ... Each category contains 105 models in 3 levels of hierarchy and is comprised of 5 Model Trees rooted by different, unrelated pre-trained Vi Ts (Dosovitskiy et al., 2020) found on Hugging Face. The second level of each Model Tree contains 4 models fine-tuned on randomly chosen datasets from the VTAB benchmark(Zhai et al., 2019).
Dataset Splits	No	The paper describes the construction of their synthetic Mo THer dataset, including the number of models, levels of hierarchy, and fine-tuning procedures. It also mentions using datasets from the VTAB benchmark. However, it does not provide specific training, validation, or test splits for the evaluation of their Mo THer Recovery method itself, nor does it specify how the VTAB datasets were split within their experiments.
Hardware Specification	No	In all our tests, Mo THer ran in seconds to minutes even on a CPU (see App. F for more details). ... Pairwise distances (CPU) 0.033 sec ... Pairwise distances (GPU) 0.1 sec.
Software Dependencies	No	For clustering the Model Graph into different Model Trees, we use hierarchical clustering over the ℓ2 pairwise distance and assume knowledge of the number of clusters. We use the scipy (Virtanen et al., 2020) implementation with the default hyperparameters. ... Specifically, we fine-tuned a new Vi T model graph with a structure similar to the FT graph (5 Model Trees, each containing 21 models). We used this Model Graph and incrementally pruned weights from the models using the l1_unstructured function in torch.nn.utils.prune... We tested 2 quantization methods: i) Simple quantization to fp16, ii) Int8 quantization using bitsandbytes.
Experiment Setup	Yes	For all the Mo THer dataset subsets, we use the following models as the Model Tree roots taken from Hugging Face: ... For the FT split, to prevent model overfitting, we use larger datasets of 10K samples rather than the original 1K used in the VTAB benchmark. Each model uses a different randomly sampled seed. See Tab. 2 for additional hyperparameters. ... Table 2: Full Fine-tuning hyperparameters (lr [6e 3, 9e 3, 2e 4, 5e 4], batch_size 64, epochs [2 5], datasets cifar100, svhn, patch_camelyon, clevr-count, clevr-distance, dmlab) ... Table 3: Lo RA Varying Ranks Fine-tuning hyperparameters (lora_rank (r) 8, 16, 32, 64, lora_alpha (α) 8, 16, 32, 64, lr [6e 3, 9e 3, 2e 4, 5e 4], batch_size 128, epochs [10 20], datasets cifar100, caltech101, dtd, flower102, pet37, svhn, patch_camelyon, clevr-count, clevr-distance, dmlab, kitti, dsprites-location, dsprites-orientation, smallnorb-azimuth, smallnorb-elevation)