reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Why Fine-grained Labels in Pretraining Benefit Generalization?

Authors: Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley H. Chan, Enming Luo

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To convince readers who are less familiar with this particular training strategy, we conduct an experiment on Image Net with details described in Appendix A.2 (we also include experiments on i Naturalist 2021 in Appendix A).
Researcher Affiliation	Collaboration	Guan Zhe Hong EMAIL Purdue University Yin Cui EMAIL NVIDIA Ariel Fuxman EMAIL Google Research Stanley H. Chan EMAIL Purdue University Enming Luo EMAIL Google Research
Pseudocode	No	The paper describes mathematical derivations and the SGD update rule, but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	No	The paper states: "All of our experiments were performed using tools in the Scenic library Dehghani et al. (2022)." This refers to a third-party tool used by the authors, not their own implementation code being released. There is no explicit statement about releasing their source code, nor is a link provided.
Open Datasets	Yes	To convince readers who are less familiar with this particular training strategy, we conduct an experiment on Image Net with details described in Appendix A.2 (we also include experiments on i Naturalist 2021 in Appendix A). Published in Transactions on Machine Learning Research (10/2024) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. Grant Van Horn and macaodha. inat challenge 2021 fgvc8, 2021. URL https://kaggle.com/competitions/ inaturalist-2021.
Dataset Splits	Yes	Figure 2 shows an experiment of pre-training on Image Net21k and fine-tuning the pre-trained network using Image Net1k. More specifically, we set X src train and X tgt train both equal to the training split of the input samples in i Naturalist2021, and set X tgt train to the testing split of the input samples in i Naturalist2021. The Image Net21k dataset we experiment on contains a total of 12,743,321 training samples and 102,400 validation samples, with 21843 leaf labels.
Hardware Specification	Yes	Each training instance (90 epochs) is run on 64 TPU v4 chips, taking approximately 1.5 to 2 days.
Software Dependencies	No	The paper mentions "Scenic library" and "Vi T-B/16 model Dosovitskiy et al. (2021)" as tools and models used, but does not provide specific version numbers for these software components. For example, it does not state "Scenic library vX.Y.Z" or provide a version for Python, PyTorch, or CUDA.
Experiment Setup	Yes	Optimization: SGD with 0.9 momentum coefficient, 0.00005 weight decay, 4096 batch size, 90 epochs total training length. We perform 7 epochs of linear warmup in the beginning of training until the learning rate reaches 0.1 * 4096/256 = 1.6, and then apply the cosine annealing schedule. For finetuning, we keep everything in the pipeline the same except setting the batch size to 4096/4 = 1024 and base learning rate 1.6/4 = 0.4.