reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visually Consistent Hierarchical Image Classification

Authors: Seulki Park, Youren Zhang, Stella Yu, Sara Beery, Jonathan Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS We first show that hierarchical classification remains challenging even for vision foundation models, which often yield inconsistent predictions. Our method outperforms existing approaches and flat baselines on benchmark datasets. We further validate our design through ablations and demonstrate that hierarchical supervision also benefits semantic segmentation.
Researcher Affiliation	Collaboration	Seulki Park1, Youren Zhang1, Stella X. Yu1,2, Sara Beery3, Jonathan Huang4 1University of Michigan 2UC Berkeley 3MIT 4Scaled Foundations EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose (e.g., Section 3.1 H-CAST FOR VISUAL CONSISTENCY and 3.2 TREE-PATH KL DIVERGENCE LOSS FOR SEMANTIC CONSISTENCY) and mathematical formulas (e.g., equations 1, 2, 3) but does not include structured pseudocode or an algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/pseulki/hcast.
Open Datasets	Yes	Datasets. We use widely used benchmarks in hierarchical recognition: BREEDS (Santurkar et al., 2021), CUB-200-2011 (Welinder et al., 2010), FGVC-Aircraft (Maji et al., 2013), and i Nat21Mini (Van Horn et al., 2021).
Dataset Splits	Yes	For BREEDS, we conduct training and validation using their source splits. BREEDS provide a wider class variety and larger sample size than CUB-200-2011 and FGVC-Aircraft, making it better suited for evaluating generalization performance. CUB-200-2011 comprises a 3-level hierarchy with order, family, and species; FGVC-Aircraft consists of a 3-level hierarchy including maker, family, and model (e.g., Boeing Boeing 707 707-320 ); For experiments on a larger dataset, we used the 3-level i Nat21-Mini. Details of the i Nat21-Mini are provided in Sec. E.4. Table 4 in Appendix summarizes a description of the datasets. (...) i Naturalist21-mini contains 10,000 classes, 500,000 training samples, and 100,000 test samples... i Naturalist-2018 includes two-level hierarchical annotations with 14 super-categories and 8,142 species, comprising 437,513 training images and 24,426 validation images.
Hardware Specification	No	The paper acknowledges 'partial compute support from NAIRR Pilot (CIS240431, CIS250430)' but does not specify any particular hardware details such as GPU models, CPU types, or memory amounts used for the experiments.
Software Dependencies	No	The paper mentions using the Dei T framework, the Adam optimizer, and various augmentation techniques (Rand Aug, label smoothing, mixup, cutmix), but does not specify version numbers for these software components or any other libraries like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	Table 5: Hyper-parameters for training H-CAST and Vi T on FGVC-Aircraft, CUB-200-2011, BREEDS, and i Naturalist datasets. We follow mostly the same set up as CAST (Ke et al., 2024). This table lists specific parameters such as batch size (256), crop size (224), learning rate (1e-3, 5e-4), weight decay (0.05), momentum (0.9), total epochs (100), warmup epochs (5), warmup learning rate (1e-4, 1e-6), optimizer (Adam), learning rate policy (Cosine decay), augmentation (Rand Aug(9, 0.5)), label smoothing (0.1), mixup (0.8), cutmix (1.0), and α (weight for TK loss) (0.5).