reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Label Test-Time Adaptation with Bound Entropy Minimization

Authors: Xiangyu Wu, Feng Yu, Yang Yang, Qing-Guo Chen, Jianfeng Lu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios.
Researcher Affiliation	Collaboration	Xiangyu Wu1,2, Feng Yu1, Qing-Guo Chen2, Yang Yang1 , Jianfeng Lu1 1Nanjing University of Science and Technology 2Alibaba International Digital Commerce Group
Pseudocode	Yes	Algorithm 1: Label Binding Algorithm Input: Logits si before label binding and the size of weak label set kxi. Output: Modified logits si after label binding.
Open Source Code	Yes	The code is available at https://github.com/Jinx630/ML-TTA.
Open Datasets	Yes	We utilize the widely employed CLIP (Radford et al., 2021) model as source model and select the multi-label datasets VOC (Everingham et al., 2010), MSCOCO (Lin et al., 2014), and NUSWIDE (Chua et al., 2009) as target domains.
Dataset Splits	Yes	The VOC dataset includes 20 categories, covering both VOC2007 and VOC2012 versions, which contain 4,952 and 5,823 test images, respectively. The MSCOCO dataset extends the category range to 80, and for testing purposes, we employ the validation sets of COCO2014 with 40,504 images and COCO2017 with 5,000 images, as the test set labels are not accessible. The NUSWIDE dataset includes 81 categories with a total of 83,898 test images of lower resolution
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory amounts) are provided. The paper mentions model architectures like RN50 and Vi T-B/16, but these are software models, not hardware specifications.
Software Dependencies	No	The paper mentions models like CLIP and LLama-2-7B, and the AdamW optimizer, but does not provide specific version numbers for software libraries, programming languages, or operating systems used for implementation.
Experiment Setup	Yes	The learning rate for the view prompt is 1e-2, while for the caption prompt is 1e-3. For all settings, multi-label test-time adaptation is performed on a single instance, i.e., the batch size is 1. The ratio for filtering confident views and captions is 0.1. The optimizer is Adam W (Loshchilov & Hutter, 2019) with a single update step, followed by immediate inference on the test instance.