reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Temporal Test-Time Adaptation with State-Space Models

Authors: Mona Schirmer, Dan Zhang, Eric Nalisnick

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on real-world temporal distribution shifts, we show that our method excels in handling small batch sizes and label shift. In Sec. 5, we conduct a comprehensive evaluation of STAD and prominent TTA baselines under authentic temporal shifts. Our results show that STAD excels in this setting (Tab. 2)...
Researcher Affiliation	Collaboration	Mona Schirmer EMAIL UvA-Bosch Delta Lab, University of Amsterdam; Dan Zhang EMAIL Bosch Center for AI; Eric Nalisnick EMAIL Johns Hopkins University
Pseudocode	Yes	Algorithm 1 STAD; Algorithm 2 EM for STAD-Gauss; Algorithm 3 EM for STAD-v MF
Open Source Code	No	The paper does not contain an explicit statement confirming the release of source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Yearbook (Ginosar et al., 2015); EVIS (Zhou et al., 2022a); FMo W-Time (Koh et al., 2021); CIFAR-10.1 (Recht et al., 2019); Image Net V2 (Recht et al., 2019); CIFAR-10 (Krizhevsky et al., 2009); Image Net (Deng et al., 2009); CIFAR-10-C (Hendrycks & Dietterich, 2019)
Dataset Splits	Yes	Yearbook: Images from 1930 to 1969 are used for training; the years 1970 2013 for testing. EVIS: Models are trained on images from 2009-2011 and evaluated on images from 2012-2020. FMo W-Time: splitting 141,696 images into a training period (2002-2012) and a testing period (2013-2017).
Hardware Specification	Yes	All experiments are performed on NVIDIA RTX 6000 Ada with 48GB memory.
Software Dependencies	No	The paper mentions software like the 'Adam optimizer' and the 'timm library' (Wightman, 2019) but does not provide specific version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	Batch sizes are the same for all baselines. To ensure optimal performance on newly studied datasets, we conduct an extensive hyperparameter search for each baseline (see App. C.4) and report the best setting. For Yearbook we comprise all samples of a year in one batch resulting in a batch size of 2048. To create online class imbalance, we reduce the batch size to 64. We use a batch size of 100 for EVIS, CIFAR.10.1 and CIFAR-10-C and 64 for FMo W-Time and Image Net V2.