reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReSi: A Comprehensive Benchmark for Representational Similarity Measures

Authors: Max Klabunde, Tassilo Wald, Tobias Schumacher, Klaus Maier-Hein, Markus Strohmaier, Florian Lemmerich

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The Re Si benchmark consists of (i) six carefully designed tests for similarity measures, (ii) 24 similarity measures, (iii) 14 neural network architectures, and (iv) seven datasets, spanning the graph, language, and vision domains. The benchmark opens up several important avenues of research on representational similarity that enable novel explorations and applications of neural architectures. We demonstrate the utility of the Re Si benchmark by conducting experiments on various neural network architectures, real-world datasets, and similarity measures. All components of the benchmark are publicly available1 and thereby facilitate systematic reproduction and production of research results.
Researcher Affiliation	Academia	1University of Passau 2Medical Image Computing, German Cancer Research Center (DKFZ) 3Helmholtz Imaging, DKFZ 4University of Heidelberg 5University of Mannheim 6RWTH Aachen University 7Heidelberg University Hospital 8National Center for Tumor Diseases (NCT) Heidelberg 9GESIS Leibniz Institute for the Social Sciences 10Complexity Science Hub
Pseudocode	No	The paper provides mathematical definitions for similarity measures but does not include any clearly labeled pseudocode or algorithm blocks with structured steps for a method or procedure.
Open Source Code	Yes	All components of the benchmark are publicly available1... 1https://github.com/mklabunde/resi... REPRODUCIBILITY STATEMENT All our code and data as well as instructions how to run the benchmark are publicly available at https://github.com/mklabunde/resi.
Open Datasets	Yes	Graphs. Specifically, we select Cora (Yang et al., 2016), Flickr (Zeng et al., 2020), and OGBN-Arxiv (Hu et al., 2020)... Language. We use two classification datasets: SST2 (Socher et al., 2013) is a collection of sentences extracted from movie reviews... MNLI (Williams et al., 2018) is a dataset of premise-hypothesis pairs... Vision. We use Image Net100 (IN100), a random subsample of 100 classes of Image Net1k (Russakovsky et al., 2015) and CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits	Yes	Graphs. We focus on graph datasets that provide multiclass labels for node classification, and for which dataset splits into training, validation and test sets are already available. Specifically, we select Cora (Yang et al., 2016), Flickr (Zeng et al., 2020), and OGBN-Arxiv (Hu et et al., 2020). For the Cora graph, we extract representations from the complete test set of 1,000 instances, whereas for Flickr and OGBN-Arxiv, we subsampled the test set to 10,000 instances for computational reasons. Language. We used the validation and validation-matched subsets to extract representations for SST2 and MNLI, respectively. Vision. For Image Net100, the 50 validation cases per class are used, resulting in N = 5000 samples, and for CIFAR100 we use the full test dataset.
Hardware Specification	No	To conduct the experiments, a broad spectrum of hardware was used: For the model training, GPU nodes with up to 80GB VRAM were employed. Depending on domain, representations were either extracted on GPU nodes and saved to disk for later processing, or extracted on demand on CPU nodes. Lastly, the representational similarity measures were calculated between representations on CPU nodes with 6-256 CPU cores and working memory between 80 and 1024 GB. The paper mentions types of hardware (GPU nodes, CPU nodes) and general specifications (VRAM, cores, RAM) but lacks specific models (e.g., NVIDIA A100, Intel Xeon).
Software Dependencies	Yes	In our implementation, we used the respective model classes as provided in the Py Torch Geometric (Fey & Lenssen, 2019) package. We always used the Adam optimizer (Kingma & Ba, 2015) as implemented in Py Torch (Paszke et al., 2019)... Otherwise, we used default hyperparameters of the transformers library5. (footnote 5: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/ trainer#transformers.Training Arguments).
Experiment Setup	Yes	Table 4: Hyperparameters for all architectures on the respective datasets in the graph domain. (includes Dimension, Layers, Activation, Dropout Rate, Learning Rate, Weight Decay, Epochs)... Table 6: Vision domain: Training hyperparameters for all architectures on the Image Net100 dataset. (includes Batch Size, Learning Rate, Weight Decay, Optimizer, Epochs)... We always used a linear learning rate schedule with 10% warm up to a maximum of 5e-5, evaluate every 1000 steps, and used a batch size of 64... all models shared more hyperparameters, namely label smoothing of 0.1 and a cosine annealing learning rate schedule.