Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our recognition model using real images from Image Net (Deng et al., 2009) and synthetic data generated with our pipeline. Each model is evaluated on Image Net-val as the in-distribution evaluation. We further consider three Image Net variations: Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-Rendition (Hendrycks et al., 2021) to evaluate out-of-distribution generalization capability... We next present ablation studies. Unless stated otherwise, experiments in this section utilize 2.4M synthetic images generated with both CD and SD.
Researcher Affiliation Collaboration Zhuoran Yu1 Chenchen Zhu2 Sean Culatana2 Raghuraman Krishnamoorthi2 Fanyi Xiao2 Yong Jae Lee1 1University of Wisconsin-Madison 2 Meta
Pseudocode No The paper describes methods and processes in narrative text and figures, such as Figure 2 depicting the synthetic data generation pipeline. However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like steps.
Open Source Code No The paper states it was "Reviewed on Open Review: https://openreview.net/forum?id=YCt8ls IDw A", which is a platform for paper review, not a code repository. There is no explicit statement about releasing code or a link to a code repository for the methodology described in the paper.
Open Datasets Yes We train our recognition model using real images from Image Net (Deng et al., 2009) and synthetic data generated with our pipeline. Each model is evaluated on Image Net-val as the in-distribution evaluation. We further consider three Image Net variations: Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-Rendition (Hendrycks et al., 2021) to evaluate out-of-distribution generalization capability... We evaluate on four datasets: Oxford Flowers (Nilsback & Zisserman, 2008), Oxford Pets (Parkhi et al., 2012), Food-101 (Bossard et al., 2014), and Describable Textures (Cimpoi et al., 2014).
Dataset Splits Yes We train our recognition model using real images from Image Net (Deng et al., 2009) and synthetic data generated with our pipeline. Each model is evaluated on Image Net-val as the in-distribution evaluation. We further consider three Image Net variations: Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-Rendition (Hendrycks et al., 2021) to evaluate out-of-distribution generalization capability... For low-data regime, we sample {100, 200, 500} real training images per class to form the real training data... The long-tailed version of Image Net is constructed by truncating the original Image Net training set. We use the exponential function (Cui et al., 2019) with different imbalance ratios (50, 100, and 200) from the original Image Net training data... Evaluation is conducted on Image Net-val, which is roughly balanced.
Hardware Specification Yes Therefore, we only use 2.4M synthetic data to train those larger models and significant improvement is already observed in Table 1. more than a week in our infrastructure with 4 8-A100 GPU nodes.
Software Dependencies No We use gpt-3.5-turbo (Open AI, 2022) from Open AI as our LLM for contextual diversification and style diversification. The paper mentions a specific LLM model used (gpt-3.5-turbo) but does not provide specific version numbers for other key software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) that would be typically required for reproducibility.
Experiment Setup Yes Implementation Details. Following prior work (Azizi et al., 2023), we evaluate our approach using different CNNs (Res Net-{50, 101, 152}) and vision transformer (Dei T-{S, B, L} (Touvron et al., 2021)) architectures... We set the synthetic loss weight λ to 0.6 for all experiments, which is selected based on in-domain evaluation of Res Net-50 alone. Full details of our hyper-parameters can be found in Appendix. (Appendix A includes Table 5: Training Details of Res Net Models and Table 6: Training Details of Vision Transformer Models with explicit values for Epochs, Batch size, Optimizer, Learning rate, Decay method, Weight decay, Warmup epochs, Label smoothing, Dropout rate, Data Augmentation, Drop Path, Mixup prob., Cutmix prob., Eval Crop Ratio).