reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Active Learning in the Open World

Authors: Tian Xie, Jifan Zhang, Haoyue Bai, Robert D Nowak

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on three long-tailed image classification benchmarks demonstrate that ALOE outperforms traditional active learning baselines, effectively expanding known categories while balancing annotation cost. Our findings reveal a crucial tradeoff between enhancing known-class performance and discovering new classes, setting the stage for future advancements in open-world machine learning.
Researcher Affiliation	Academia	Tian Xie EMAIL Jifan Zhang EMAIL Haoyue Bai EMAIL Robert Nowak EMAIL University of Wisconsin-Madison
Pseudocode	Yes	Algorithm 1 ALOE: Active Ly Learning in Open-world Environments
Open Source Code	No	Code of the work will be be uploaded at https://github.com/Efficient Training/Label Bench.
Open Datasets	Yes	Specifically, we utilize three image classification benchmark datasets, CIFAR100-LT (Alex, 2009), Image Net-LT (Deng et al., 2009) and Places365-LT (Zhou et al., 2017).
Dataset Splits	Yes	CIFAR100 contains 60,000 images across 100 classes, with 500 training images and 100 testing images per class. To create a long-tailed version of CIFAR100, we use an exponential distribution where the number of examples per class Ni is given by: Ni = N0α i n , where n is the total number of classes, N0 is the number of examples in the most frequent class, and α is the imbalance factor. In our experiments, we set α = 0.01, creating a highly imbalanced version of the dataset.
Hardware Specification	Yes	All experiments are conducted on NVIDIA TITAN RTX for CIFAR100-LT and Places365-LT, and NVIDIA A100 for Image Net-LT.
Software Dependencies	Yes	Our method is implemented with Py Torch 2.2.0.
Experiment Setup	Yes	For evaluation, we follow the latest Label Bench framework (Zhang et al., 2024a), while introducing the new open world setting with dynamic number of classes at each iteration. Specifically, we fine-tune the pretrained CLIP Vi T-B32 image encoder (Radford et al., 2021) with a linear classification head attached. For every iteration of the active learning algorithm, the model is reinitialized to the pretraining checkpoint and finetuned end-to-end on all labeled examples thus far. The clustering method, k-means, is then applied to these embedded features. The number of clusters 2 max(B, \|Kt\|) is set so that we obtain a surplus of clusters to effectively filter out the in-distribution examples, where B is the batch size and \|Kt\| is the number of annotated classes in step t. Empirically throughout our experiments, we find the multiplier value 2 to be a good and robust value. To identify OOD examples, we establish a threshold at the 95%-TPR cutoff based on the in-distribution labeled examples. This threshold is commonly used in the OOD detection literature, and ensures at least 95% of the labeled examples are classified as ID. The clusters are then ranked based on their OOD cluster ratio. The top B clusters are selected for further processing, and from each of these selected clusters, the example with the highest OOD score is chosen to form the final batch X(t) of examples for annotation.