Multi-Label Test-Time Adaptation with Bound Entropy Minimization

Authors: Xiangyu Wu, Feng Yu, Yang Yang, Qing-Guo Chen, Jianfeng Lu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios.
Researcher Affiliation Collaboration Xiangyu Wu1,2, Feng Yu1, Qing-Guo Chen2, Yang Yang1 , Jianfeng Lu1 1Nanjing University of Science and Technology 2Alibaba International Digital Commerce Group
Pseudocode Yes Algorithm 1: Label Binding Algorithm Input: Logits si before label binding and the size of weak label set kxi. Output: Modified logits si after label binding.
Open Source Code Yes The code is available at https://github.com/Jinx630/ML-TTA.
Open Datasets Yes We utilize the widely employed CLIP (Radford et al., 2021) model as source model and select the multi-label datasets VOC (Everingham et al., 2010), MSCOCO (Lin et al., 2014), and NUSWIDE (Chua et al., 2009) as target domains.
Dataset Splits Yes The VOC dataset includes 20 categories, covering both VOC2007 and VOC2012 versions, which contain 4,952 and 5,823 test images, respectively. The MSCOCO dataset extends the category range to 80, and for testing purposes, we employ the validation sets of COCO2014 with 40,504 images and COCO2017 with 5,000 images, as the test set labels are not accessible. The NUSWIDE dataset includes 81 categories with a total of 83,898 test images of lower resolution
Hardware Specification No No specific hardware details (GPU/CPU models, memory amounts) are provided. The paper mentions model architectures like RN50 and Vi T-B/16, but these are software models, not hardware specifications.
Software Dependencies No The paper mentions models like CLIP and LLama-2-7B, and the AdamW optimizer, but does not provide specific version numbers for software libraries, programming languages, or operating systems used for implementation.
Experiment Setup Yes The learning rate for the view prompt is 1e-2, while for the caption prompt is 1e-3. For all settings, multi-label test-time adaptation is performed on a single instance, i.e., the batch size is 1. The ratio for filtering confident views and captions is 0.1. The optimizer is Adam W (Loshchilov & Hutter, 2019) with a single update step, followed by immediate inference on the test instance.