reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Weak-to-Strong Generalization for CLIP-based Classification

Authors: Jinhao Li, Sarah Monazam Erfani, Lei Feng, James Bailey, Feng Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to evaluate the performance of the proposed method, CPL, using the Domain Net dataset (Peng et al., 2019), which includes six diverse visual domains... Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67% improvement over strong baseline methods.
Researcher Affiliation	Academia	Jinhao Li EMAIL School of Computing and Information Systems University of Melbournem, Australia. Sarah M. Erfani EMAIL School of Computing and Information Systems University of Melbourne, Australia. Lei Feng EMAIL School of Computer Science and Engineering Southeast University, China. James Bailey EMAIL School of Computing and Information Systems University of Melbourne, Australia. Feng Liu EMAIL School of Computing and Information Systems University of Melbourne, Australia.
Pseudocode	Yes	Algorithm 1: Weak-to-strong Generalization for VLMs
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing their code or a link to a repository.
Open Datasets	Yes	In our exploration of weak-to-strong scenarios, we turn to the challenging and relatively large dataset: Domain Net (Peng et al., 2019). Comprising six diverse domains, each housing 345 categories of common objects, Domain Net offers a rich landscape for analysis... Ablation on Office Home. As shown in Table 6, our method achieves the best performance across all four domains of the Office Home dataset, outperforming existing baselines by a clear margin.
Dataset Splits	Yes	(1) Dataset splitting: Referring to Table 2, each domain is divided into a training set Dtrain and a test set Dtest. The test set Dtest is further partitioned into Dhold and D test, comprising 80% and 20% of Dtest respectively.
Hardware Specification	Yes	All our experiments are conducted using a single A100 GPU with 40GB of memory, supported by 8 CPU workers and 64GB of RAM.
Software Dependencies	No	The code is mainly based on Pytorch and the Huggingface library. This statement lacks specific version numbers for the mentioned libraries.
Experiment Setup	Yes	During training, we used a test batch size of 2048 for evaluation. The weak model was trained for 3 epochs with a batch size of 512 and a learning rate of 1e-3, whereas the strong model underwent 10 epochs with the same batch size and a learning rate of 1e-2. The learning rate was adjusted dynamically, and a warm-up ratio of 0.1 was utilized.