reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

IT$^3$: Idempotent Test-Time Training

Authors: Nikita Durasov, Assaf Shocher, Doruk Oner, Gal Chechik, Alexei A Efros, Pascal Fua

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
Researcher Affiliation	Collaboration	1CVLAB, EPFL 2NVIDIA 3Neura Vision Lab, Bilkent University 4UC Berkeley.
Pseudocode	No	The paper describes the methods and algorithms through text and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	poster / code / video / web
Open Datasets	Yes	We conducted similar experiments using the CIFAR10 (Krizhevsky et al., 2014) dataset, selecting CIFAR-C (Hendrycks & Dietterich, 2019) as the out-of-distribution (OOD) data. [...] we use The Boston Housing dataset describes housing prices in the suburbs of Boston, Massachusetts. [...] For this experiment, we used the Deep Layer Aggregation (DLA) (Yu et al., 2018) network [...] we use the UTKFace dataset (Zhang et al., 2017) [...] road segmentation in aerial imagery using the Road Tracer dataset (Bastani et al., 2018). We train a DRU-Net (Wang et al., 2019), on the Road Tracer dataset. [...] We perform OOD experiments using Massachusetts Road dataset (Mnih, 2013) [...] we generated a dataset of 2,000 wing profiles, as depicted in Fig.10, by sampling the widely used NACA parameters (Jacobs & Sherman, 1937). [...] we experimented with 3D car models from a subset of the Shape Net dataset (Chang et al., 2015) [...] Image Net-C (Hendrycks & Dietterich, 2019) consists of Image Net (Krizhevsky et al., 2012) test images corrupted using the same transformations as CIFAR-10/100C (Sec. 4.2).
Dataset Splits	No	The paper refers to using training and test sets and describes how OOD data was generated or selected, but it does not provide specific percentages, sample counts, or detailed methodologies for the train/validation/test splits of the primary datasets for reproducibility. For instance, for tabular data, it states: "We take a test set and gradually apply random feature zeroing with increasing probabilities of 5%, 10%, 15%, and 20% (4 mentioned levels of OOD)." and for age prediction: "We train our model on the UTKFace training set (limited to individuals aged 20-60)." It does not explicitly state how these original training/test sets were formed or their sizes.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts. Table 5 in Appendix A mentions "Memory Consumption (OOD Imaget Net), GPU Gb" but does not specify the type of GPU.
Software Dependencies	No	The paper mentions using "Adam optimizer", "Pytorch training protocol (Paszke et al., 2017)", "XFoil simulator (Drela, 1989)", and "Open FOAM (Jasak et al., 2007)". However, it does not specify version numbers for PyTorch, Adam, or the other software packages, which is required for reproducible software dependencies.
Experiment Setup	Yes	To predict drag associated to a triangulated 3D car, we utilize similar model to airfoil experiments but with increased capacity. Instead of twenty five GMM layers, we use thirty five and also apply skip-connections with ELU activations. Final model is being trained for 100 epochs with Adam optimizer and 10 3 learning rate.