reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active Learning and Model Selection

Authors: Yushu Li, Yongyi Su, Xulei Yang, Kui Jia, Xun Xu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate on 5 TTA datasets that the proposed HILTTA approach is compatible with off-the-shelf TTA methods and such combinations substantially outperform the state-of-the-art HILTTA methods. Importantly, our proposed method can always prevent choosing the worst hyper-parameters on all off-the-shelf TTA methods. The source code is available at https://github.com/Yushu-Li/HILTTA. (...) 4 Experiments (...) We report the classification error rates for continual TTA in Tab. 2.
Researcher Affiliation	Academia	1South China University of Technology, Guangzhou, China 2Institute for Infocomm Research (I2R), A*STAR, Singapore 3School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Pseudocode	Yes	Algorithm 1: Human-in-the-Loop TTA Input : Source Model θ0; Candidate Hyper-parameters Ω= {ωm}; Testing Data Batches {Bt} Output: Predictions ˆY = {ˆyi} for t = 1 to T do # Make Predictions: xi Bt, ˆyi = h(xi; θt 1 ), ˆY = ˆY ˆyi # Oracle Annotation: Select labeled subset Bl t by Eq. 8 for m = 1 to Nm do # Unsupervised Model Adaptation: θt m = arg min θ 1 Nb P xi Bu t Lu tr(h(xi; θ); ωm) # Update Moving Average Validation Loss: Update Lt val by Eq. 6. # Model Selection by Eq. 6: θt = arg min m Lt val(θ m) # Supervised Model Adaptation: θt = arg minθ 1 Nb P xi,yi Bl t Ll tr(h(xi; θ), yi; ωm) return Predictions ˆY;
Open Source Code	Yes	The source code is available at https://github.com/Yushu-Li/HILTTA.
Open Datasets	Yes	Datasets: We select a total of five datasets for evaluation. The CIFAR10-C and CIFAR100C (Hendrycks & Dietterich, 2018) are small-scale corruption datasets, with 15 different common corruptions, each containing 10,000 corrupt images with 10/100 categories. For our evaluation in large-scale datasets, we opt for Image Net-C (Hendrycks & Dietterich, 2018), which also contains 15 different corruptions, each with 50,000 corrupt images in 1000 categories. Additionally, Image Net-D (Rusak et al., 2022) is a style-transfer dataset, offering 6 domain shifts, each consisting of 10,000 images selected in 109 classes. Additionally, we evaluate our method on the Model Net40-C dataset (Sun et al., 2022), which includes 3,180 3D point clouds affected by 15 common types of corruption.
Dataset Splits	Yes	For CIFAR10-C and CIFAR100-C datasets, we use a batch size of 200 and an annotation percentage of 3%, with pre-trained Wide Res Net-28 (Zagoruyko & Komodakis, 2016) and Res Ne Xt-29 (Xie et al., 2017) models, respectively. For Image Net-C and Image Net-D, we use a batch size of 64 for adaptation and an annotation percentage of 3.2% of the total testing samples, employing the Res Net-50 (He et al., 2016) pre-trained model. (...) we conduct experiments on 3D point cloud classification using DGCNN (Wang et al., 2019) as the backbone, adapting it continuously across 15 domains within the Model Net40-C (Sun et al., 2022) dataset. We maintained the same hyper-parameter candidate set as outlined in Tab. 8 and followed the experimental setup detailed in Tab. 2, with a batch size of 32 and an annotation percentage of 3.2%.
Hardware Specification	Yes	We conduct empirical studies to measure the wall-clock time of both the inference and adaptation steps in HILTTA, utilizing a single RTX 3090 GPU, an Intel Xeon Gold 5320 CPU, and 30 GB of RAM.
Software Dependencies	No	The paper mentions specific TTA methods and refers to their official implementations via GitHub links in footnotes (e.g., "1https://github.com/DequanWang/tent"), and also mentions the "Higher package (Grefenstette et al., 2019)" for meta optimization. However, it does not explicitly state specific version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup	Yes	Implementation Details: We evaluate TTA performance in the continual adaptation setting (Wang et al., 2022; Niu et al., 2022), where the target domain undergoes continuous changes. For CIFAR-10-C and CIFAR100-C datasets, we use a batch size of 200 and an annotation percentage of 3%, with pre-trained Wide Res Net-28 (Zagoruyko & Komodakis, 2016) and Res Ne Xt-29 (Xie et al., 2017) models, respectively. For Image Net-C and Image Net-D, we use a batch size of 64 for adaptation and an annotation percentage of 3.2% of the total testing samples, employing the Res Net-50 (He et al., 2016) pre-trained model. We use the optimizer recommended in the original papers for each method s unsupervised training. We set the momentum β for 0.5. For supervised training, we use the Adam optimizer with a learning rate of 1e-5. We consider seven distinct values for each hyper-parameter category regarding model selection, detailed in Tab. 8.