reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PTTA: Purifying Malicious Samples for Test-Time Model Adaptation

Authors: Jing Ma, Hanlin Li, Xiang Xiang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on four types of TTA tasks as well as classification, segmentation, and adversarial defense demonstrate the effectiveness of our method.
Researcher Affiliation	Academia	1National Key Lab of Multi Spectral Info. Intelligent Processing Tech., School of Artificial Intelligence and Automation, Huazhong University of Science and Tech. (HUST), Wuhan, China. 2Peng Cheng National Lab, Shenzhen, China. 3School of Computer Science and Technology, HUST, China. Correspondence to: Xiang Xiang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Purification for Test-Time Adaptation (PTTA) Input: Source model fθ with parameters θ. Test samples xt = {xti}Nbs i=1 Dtest arrived at time step t. A memory bank M. Benign samples {x+ ti, y+ ti} selected by basic TTA methods. Hyperparameters α. Learning rate η. Output: Predictions ˆyt = {ˆyti}Nbs i=1 for test samples xt. for xt Dtest do Compute predictions ˆyt = fθ(xt) and logit-saliency indicator z LEnt(fθ(xt)) ; // (Eq. 4) Incorporate selected benign samples {x+ ti, y+ ti, z LEnt(fθ(x+ ti))} into M; Retrieve x j from M using saliency distance Dsa(xi, xj) ; // (Eq. 3 and 5) Generate purified image x ij and its pseudo-label y ij using Mixup ; // (Eq. 6) Compute total loss Ltotal = Ltta + αLpur(x ij, y ij) ; // (Eq. 7) Update θ with θ θ η θLtotal; end
Open Source Code	Yes	Code is available at https: //github.com/HAIV-Lab/PTTA.
Open Datasets	Yes	We employ Image Net-C, CIFAR100-C (Hendrycks & Dietterich, 2018), Image Net (Deng et al., 2009) and its variants: -A (Hendrycks et al., 2021b), -V2. (Recht et al., 2019), -R. (Hendrycks et al., 2021a), -S. (Wang et al., 2019) to construct these tasks. Image Net-C contains 15 types of corruptions applied to the original Image Net validation images, each having 5 severity levels. We exploit the most severe level (5-th level) for experiments. The same applied to CIFAR100-C. ... Beyond image classification, we also consider the semantic segmentation task and employ Carla TTA dataset (Marsden et al., 2024a) for experiments.
Dataset Splits	Yes	In episodic task, a single batch of test samples is used to optimize the model, and then the updated model makes predictions for current batch. After that, model s parameters are reset to the source. For single and continual tasks, the model is iteratively updated in static and dynamically changing environments respectively. Furthermore, lifelong task extends the dynamic environment indefinitely, set as 10 rounds and a total of 150 corruptions. Following previous TTA methods, we employ Image Net-C, CIFAR100-C (Hendrycks & Dietterich, 2018), Image Net (Deng et al., 2009) and its variants: -A (Hendrycks et al., 2021b), -V2. (Recht et al., 2019), -R. (Hendrycks et al., 2021a), -S. (Wang et al., 2019) to construct these tasks. Image Net-C contains 15 types of corruptions applied to the original Image Net validation images, each having 5 severity levels. We exploit the most severe level (5-th level) for experiments. ... We run experiments on 3 random seeds and report the average accuracy and the standard deviation.
Hardware Specification	Yes	Hardware: CPU: Intel Xeon Silver 4210 @ 2.20GHz \| GPU: NVIDIA Ge Force RTX 3090 \| RAM: 256GB
Software Dependencies	Yes	Software: Py Torch 1.9.0 \| CUDA 11.1
Experiment Setup	Yes	We set λ = 1/(K + 1), where K decides that the top-K samples with the largest saliency distance are retrieved from the scope. We uniformly set K = 1 and conduct an ablation study on the value of K in Sec. 4.4. Overall, the total loss function is defined as Ltotal = Ltta + αLpur, where α is a hyperparameter to balance the two loss functions, and we conduct an ablation study on α in Sec. 4.4. ... We set α = 3.0 for sample-selection-based TTA methods and α = 1.0 for selection-free TTA methods. We build a memory bank for OOD retrieval, setting it as a first-in-first-out queue and limiting its maximum length to 1, 000. ... For Res Net50, we consistently use the SGD with a learning rate of 0.00025, a momentum of 0.9, a batch size of 64, and no weight decay. For Vi T-B/16, we consistently use the SGD with a learning rate of 0.001, a momentum of 0.9, a batch size of 64, and no weight decay.