reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Linear Probe Generators for Weight Space Learning

Authors: Jonathan Kahana, Eliahu Horwitz, Imri Shuval, Yedid Hoshen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation follows the standard protocol for weight space learning. We evaluate on two tasks: (i) CNNs generalization error prediction and (ii) detecting the training classes of images based on INR networks trained on them. We include experiments on small-scale established benchmarks as well as a new larger-scale Model Zoo which we present, using Res Net18(He et al., 2016) models. ... Table 3: Results for Small Scale Benchmarks. Comparison of Probe Gen, to graph based, mechanistic approaches and latent optimized probes. We average the results over 5 different seeds.
Researcher Affiliation	Academia	Jonathan Kahana, Eliahu Horwitz, Imri Shuval, Yedid Hoshen School of Computer Science and Engineering The Hebrew University of Jerusalem, Israel EMAIL
Pseudocode	No	The paper describes the methodology in prose and through figures, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project page: https://vision.huji.ac.il/probegen/ ... 11 REPRODUCABILLITY In this work, we presented a new and light-weight framework for weight space learning. Our method is simple to implement, and can be easily reproduced. To encourage future work in this direction, we provide a short implementation of our method in the supplementary materials.
Open Datasets	Yes	We evaluate on 4 established datasets. For training data prediction we choose the MNIST and FMNIST implicit neural representation (INR) benchmarks (Navon et al., 2023a). ... For generalization error prediction, we used the CIFAR10-GS (Unterthiner et al., 2020) and CIFAR10 Wild Park tasks Kofinas et al. (2024). ... Tiny Imagenet Le & Yang (2015); Deng et al. (2009). ... we use the Neural-Field-Arena (Papa et al., 2024), to evaluate Probe Gen s ability in classifying INRs trained on point clouds from the Shape Net (Chang et al., 2015) dataset.
Dataset Splits	Yes	We evaluate on 4 established datasets. For training data prediction we choose the MNIST and FMNIST implicit neural representation (INR) benchmarks (Navon et al., 2023a). ... For generalization error prediction, we used the CIFAR10-GS (Unterthiner et al., 2020) and CIFAR10 Wild Park tasks Kofinas et al. (2024). ... Each Res Net model was trained on a a randomly selected subset of Tiny Imagenet Le & Yang (2015); Deng et al. (2009). We sampled the subset out of a closed list of 10 subsets, that we created in advance.
Hardware Specification	No	The paper mentions computational costs in terms of FLOPs and states that inferring about a model would require "computational resources equivalent to training such a model." However, it does not specify any particular hardware like GPU models, CPU types, or memory used for the experiments.
Software Dependencies	No	The paper states that an implementation of the method is provided in the supplementary materials, but it does not specify any software libraries or frameworks with their version numbers that would be required to reproduce the experiments.
Experiment Setup	Yes	Hyper-parameters. We use a learning rate of 3 10 4 and a batch size of 32 in all our experiments. Our MLP classifier C, uses 6 layers with a hidden size of 256. The latent vectors of each probe are of size 32. We trained all probing algorithms on the INR and CIFAR10 Wild Park experiments for 30 epochs, all experiments on the CIFAR10-GS dataset for 150 epochs, and all experiments on our Res Net18 Model Zoo dataset for 100 epochs.