reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DynaPrompt: Dynamic Test-Time Prompt Tuning

Authors: Zehao Xiao, Shilin Yan, Jack Hong, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiayi Shen, Qi Wang, Cees G Snoek

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we conduct experiments on fourteen benchmarks, covering typical evaluation scenarios such as domain generalization and cross-dataset. The results show the effectiveness of the proposed method.
Researcher Affiliation	Collaboration	1AIM Lab, University of Amsterdam 2Xiaohongshu Inc. 3Department of Automation, Tsinghua University
Pseudocode	Yes	We provide an algorithm of our method in Appendix A.
Open Source Code	Yes	Codes are available at https://github.com/zzzx1224/Dyna Prompt.
Open Datasets	Yes	Fifteen datasets. Following previous methods (Shu et al., 2022; Samadh et al., 2023), we conduct experiments across two settings that suffer from distribution shifts to demonstrate the effectiveness of our method: domain generalization and cross-dataset shifts. For the domain generalization setting, we evaluate the method on Image Net (Deng et al., 2009) and its four variant datasets: Image Net-V2 (Recht et al., 2019), Image Net-(S)ketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a). For the cross-dataset setting, we evaluate our method on 10 image classification datasets covering various tasks: Caltech101 (Fei-Fei et al., 2004), Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), FGVCAircraft (Maji et al., 2013), SUN397 (Xiao et al., 2010), DTD (Cimpoi et al., 2014), Euro SAT (Helber et al., 2019), and UCF101 (Soomro et al., 2012).
Dataset Splits	Yes	For the domain generalization setting, we evaluate the method on Image Net (Deng et al., 2009) and its four variant datasets: Image Net-V2 (Recht et al., 2019), Image Net-(S)ketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b), and Image Net-R (Hendrycks et al., 2021a). For the cross-dataset setting, we evaluate our method on 10 image classification datasets... Following TPT (Shu et al., 2022), we generate 63 augmentations by random resize crops for each individual test image to construct a batch of 64 images including the original image.
Hardware Specification	Yes	Our method runs on an NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions "Adam W optimizer" and that it is "Based on the CLIP model with Vi T-Base-16", but does not specify version numbers for any key software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python) used.
Experiment Setup	Yes	Based on the CLIP model with Vi T-Base-16 (Dosovitskiy et al., 2020), we initialize our dynamic prompts with the manually crafted a photo of a and optimize the prompts online in the text input embedding space. The prompt set optimized by one test sample is utilized for the next sample. Following TPT (Shu et al., 2022), we generate 63 augmentations by random resize crops for each individual test image to construct a batch of 64 images including the original image. During the dynamic tuning, we calculate the entropy and augmentation probability differences over these 63 augmented images as the dynamic prompt selection metrics. The thresholds are obtained in the same way based on the initial prompt. We set the maximum number of the prompt set M as 10. We append new prompts to the dynamic prompt set when no appropriate prompt is selected for the test sample. Once the number of prompts in the prompt set V exceeds M, we remove the prompt that has been inactive for the longest time. For optimization, we select the top 10% confident samples among the batch and calculate the entropy of the averaged logits of the selected predictions following Shu et al. (2022). We utilize a learning rate of 0.005 for domain generalization and 0.003 for the cross-dataset settings with the Adam W optimizer.