Efficient Open Set Single Image Test Time Adaptation of Vision Language Models

Authors: Manogna Sreenivas, Soma Biswas

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. The code is released at https://github.com/manogna-s/ROSITA.git.
Researcher Affiliation Academia Manogna Sreenivas EMAIL Indian Institute of Science, Bengaluru Soma Biswas EMAIL Indian Institute of Science, Bengaluru
Pseudocode No The paper includes a figure (Figure 1: ROSITA framework) which is a diagram illustrating the framework, and mathematical equations for the loss functions and gradients (Appendix A), but no explicitly structured pseudocode or algorithm blocks.
Open Source Code Yes Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. The code is released at https://github.com/manogna-s/ROSITA.git.
Open Datasets Yes Datasets. We experiment with a diverse set of datasets to choose desired class data Dd and undesired class data Du. For Dd, we use CIFAR10C (Hendrycks & Dietterich, 2019), CIFAR100C (Hendrycks & Dietterich, 2019), Image Net C (Hendrycks & Dietterich, 2019), CCC Press et al. (2023) from the corruption category and Image Net R (Hendrycks et al., 2021), Vis DA (Peng et al., 2017) and the Clipart, Painting, Sketch domains from Domain Net (Peng et al., 2019) as style transfer datasets. We introduce samples from MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), CIFAR10/100C (Hendrycks & Dietterich, 2019) and Tiny Image Net (Le & Yang, 2015) datasets as Du in the test stream. We describe the datasets in detail in Appendix B.3.
Dataset Splits Yes OSTTA scenarios. We simulate several test scenarios inspired by the real world to evaluate the effectiveness of our method. (1) Single domain: We extend the standard TTA scenario where the test samples come from an unseen domain Dd (say snow corruption of CIFAR-10C) by incorporating undesired samples Du (say CIFAR-100C). (2) Continuously changing domains: Here, Dt changes with time as (D1 d Du) (D2 d Du) . . . (Dn d Du), where Di d is the ith domain encountered. (3) Frequently changing domains: Here, we significantly reduce the number of samples per domain in continuous open-set TTA. The fewer the samples per domain, the more frequently the test domain changes, simulating very dynamic open-set test scenarios. (4) Varying sample ratio: The proportion of samples from Cd and Cu in the test stream is varied. ... for CIFAR10C/MNIST, we reduce the number of samples per corruption to 100, 250, 500, and 1000 in the continuously changing domain open-set TTA scenario.
Hardware Specification Yes All experiments are done on a single NVIDIA A6000 GPU.
Software Dependencies No The paper mentions optimizers like SGD and Adam W, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes Implementation Details. We use CLIP and Ma PLe backbones with Vi T-B16 architecture. For ROSITA, we use SGD optimizer with a learning rate of 0.001 to update the Layer Norm parameters of the Vision encoder. We set size of the score bank S to 512, number of neighbours K to 5. The size of feature bank Md is set as K Cd and that of Mu to 64. Implementation details for all the baseline methods are presented in Appendix B.4.