reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Debiased Generalized Category Discovery

Authors: Pengcheng Guo, Yonghong Song, Boyu Wang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various datasets shows De GCD achieves state-of-the-art performance and maintains a good balance between new and old classes. In addition, this method can be seamlessly adapted to other GCD methods, not only to achieve further performance gains but also to effectively balance the performance of the new class with that of the old class. We validate the effectiveness of the proposed De GCD on the generic benchmarks (including CIFAR10/100 [Krizhevsky et al., 2009] and Image Net-100 [Deng et al., 2009]), the recently proposed Semantic Shift Benchmark (SSB, including CUB [Wah et al., 2011], Stanford Cars [Krause et al., 2013] and the harder Herbarium 19 [Tan et al., 2019]).
Researcher Affiliation	Academia	1School of Software Engineering, Xi an Jiaotong University, China 2Department of Computer Science and the Brain Mind Institute, University of Western Ontario, Canada
Pseudocode	No	The paper describes methods using mathematical formulations and text, but does not include a distinct pseudocode or algorithm block.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets	Yes	We validate the effectiveness of the proposed De GCD on the generic benchmarks (including CIFAR10/100 [Krizhevsky et al., 2009] and Image Net-100 [Deng et al., 2009]), the recently proposed Semantic Shift Benchmark (SSB, including CUB [Wah et al., 2011], Stanford Cars [Krause et al., 2013] and the harder Herbarium 19 [Tan et al., 2019]).
Dataset Splits	Yes	For each dataset, following [Vaze et al., 2022], we sample a subset of all classes as the labeled ( Old ) classes Yl. Half of the images from these classes form Dl, while the rest are treated as unlabeled data Du. Table 1 shows the statistics of the datasets we evaluate on.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running experiments.
Software Dependencies	No	The paper mentions using ViT-B/16 [Dosovitskiy et al., 2021] pre-trained by DINO [Caron et al., 2021] as the backbone, but does not provide specific software dependency versions (e.g., PyTorch version, CUDA version, Python version, etc.).
Experiment Setup	Yes	We train for 200 epochs using a batch size of 256 and an initial learning rate of 0.1, which follows a cosine decay schedule for each dataset. In this study, the balancing factor ψ is set to 0.35, while the parameter ω is set to 0.2. Following [Wen et al., 2023], the temperature values η is 0.1. η is set to 0.95 follows [Sohn et al., 2020]. Meanwhile, according to [Wen et al., 2023], τs is fixed at 0.1, and τt is initialized at 0.07, then gradually reduced to 0.04 using a cosine schedule over the first 30 epochs. In this work, we set the Gamma parameter β to 0.5. Following [He et al., 2020], ϵ, µ = 0.99.