reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model

Authors: Huan Ma, Yan Zhu, Changqing Zhang, Peilin Zhao, Baoyuan Wu, Long-Kai Huang, Qinghua Hu, Bingzhe Wu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comparative analysis of the proposed method against various approaches which validates the significant superiority. Extensive experiments confirm that our approach significantly enhances the model stability against decision shortcuts compared to existing state-of-the-art methods. In the experiments, we evaluate different methods on different scenarios, including the real-word image data Tiny Image Net (Le and Yang 2015), CUB-200 (Wah et al. 2011) and the benchmark simulated data Waterbirds (Koh et al. 2021), and the datasets created by S2E Camel Deer and Spider Crab. We submit a subset of these datasets in supplementary materials limited by the maximum file size.
Researcher Affiliation	Collaboration	1College of Intelligence and Computing, Tianjin University, Tianjin, China 2AI Lab, Tencent, Shenzhen, China
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations (e.g., "We begin by presenting the basic methodology of visual-language prompt tuning. Subsequently, we will introduce our proposed approach..."), but it does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	1Codes: https://github.com/Ma Huan AAA/SEraser
Open Datasets	Yes	In the experiments, we evaluate different methods on different scenarios, including the real-word image data Tiny Image Net (Le and Yang 2015), CUB-200 (Wah et al. 2011) and the benchmark simulated data Waterbirds (Koh et al. 2021), and the datasets created by S2E Camel Deer and Spider Crab. We submit a subset of these datasets in supplementary materials limited by the maximum file size.
Dataset Splits	No	The paper mentions evaluating performance on the "worst group" of datasets (e.g., Waterbirds, Camel Deer, Spider Crab) and refers to "the entire test set," implying the existence of test splits. However, it does not explicitly provide specific percentages or counts for training/validation/test splits, nor does it detail the methodology for creating these splits for reproduction (e.g., 80/10/10 split, specific random seeds, or citations to standard splits for all used datasets).
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using "CLIP with a pre-trained Vi T-B-32 released by Open AI (Radford et al. 2021)" and "SAM model (Wang et al. 2023b)". However, it does not provide specific version numbers for these or any other key software components, programming languages, or libraries used in their implementation.
Experiment Setup	No	The paper describes the general approach of optimizing a learnable prompt and minimizing Kullback-Leibler divergence as the optimization goal. However, it does not provide specific details such as learning rates, batch sizes, number of epochs, optimizer types, or other hyperparameter values that would allow for direct reproduction of the experimental setup.