reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Data Mining with Longtail-Guided Diffusion

Authors: David S Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric M Wolff, Siddhartha Srinivasa

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model. We evaluate on seven natural image datasets spanning fine-grained, coarse-grained, and mixed coarse/fine classification tasks. Datasets include Image Net, Image Net-V2, Image Net A, Stanford Cars (Krause et al., 2013), Oxford Flowers (Nilsback & Zisserman, 2008), Oxford Pets (Parkhi et al., 2012), and Caltech101 (Fei-Fei et al., 2004). Results are averaged over three runs.
Researcher Affiliation	Collaboration	1Cruise, LLC, San Francisco, CA 2Open AI, San Francisco, CA 3Colombia University, New York, NY 4Upwork, Palo Alto, CA 5Meta, Menlo Park, CA.
Pseudocode	Yes	Algorithm 1 Longtail Guidance Input: Latent diffusion model ϵθ(zt, t), predictor fϕ(x) latent decoder D, noise schedule σ1:T , weight w Initialize: z T N(0, I) for t = T 1, . . . , 0 do Estimate terminal latent state ˆz0 t = P(zt) as in Eqn. 8 Decode terminal data state: ˆx0 t = D(ˆz0 t ) Compute model longtail signal f lt ϕ(ˆx0 t) as in Eqn. 4 Bias denoising estimate as in Eqn. 11 Compute zt 1 as in Eq. 7 end return x = D(z0)
Open Source Code	No	The paper does not contain any explicit statements about code release (e.g., "We release our code...") nor does it provide any direct links to code repositories. Mentions of third-party tools like Stable Diffusion, LLaVA-1.6 7B, and GPT-4o do not count as providing code for the methodology described in this paper.
Open Datasets	Yes	We evaluate on seven natural image datasets spanning fine-grained, coarse-grained, and mixed coarse/fine classification tasks. Datasets include Image Net, Image Net-V2, Image Net A, Stanford Cars (Krause et al., 2013), Oxford Flowers (Nilsback & Zisserman, 2008), Oxford Pets (Parkhi et al., 2012), and Caltech101 (Fei-Fei et al., 2004). Table 7. Overview of datasets with number of classes, training samples, validation samples, and synthetic samples. *Image Net-A and Image Net-V2 are not trained on; they are only used for evaluation.
Dataset Splits	Yes	We use the same predictive model architecture (Res Net50 trained from scratch), generate the same quantity of synthetic data (30 dataset size for Pets, 20 dataset size for Caltech101, Cars, and Flowers), the same diffusion model (Stable Diffusion v1.4), the same diffusion sampler (DDIM), and the same number of sampling iterates (50). We then fine-tune for 100 epochs, generating synthetic data with Longtail Guidance according to the schedule in Table 5 (e.g. generate synthetic data in epoch 0, fine-tune until epoch 5, generate more synthetic data, fine-tune until epoch 10, ...). In Table 3, we train SOTA Vi T-based models (Li VT) from scratch on Image Net-LT according to the longtail-compensation approach of (Xu et al., 2023). In summary, training includes MAE pretraining followed by 100 epochs of BCE loss with a logit adjustment to account for class imbalance.
Hardware Specification	Yes	All experiments are performed on 8x H100. (from Section A.1) and 50 DDIM sampling steps on 8x H100 GPUs (from A.5).
Software Dependencies	Yes	We use the same Stable Diffusion 1.4 baseline and sampling details as in Section 4.1 (Li VT SD). We construct refined prompts for each class by prompting LLa VA-1.6 7B (Liu et al., 2024) to generate a description for two sets of synthetic images... Following, we create P = 40 refined prompts per class by prompting GPT-4o (Open AI, 2023) with: <VLM description of the image> The following keywords describe the key features of the description above: <Keyword 1>, <Keyword 2> ... Use a complete sentence to summarize the key features. The sentence should start with: A photo of <Class> that....
Experiment Setup	Yes	We then fine-tune for 100 epochs, generating synthetic data with Longtail Guidance according to the schedule in Table 5 (e.g. generate synthetic data in epoch 0, fine-tune until epoch 5, generate more synthetic data, fine-tune until epoch 10, ...). We also experimentally found that using total uncertainty (epistemic + aleatoric, the first term in the RHS of Equation 4) as the LTG guidance signal from the Epistemic Head was overall slightly more performant for downstream predictive model generalization improvements (as compared to epistemic or aleatoric alone)... Fine-tuning is with the Adam optimizer, cosine annealing learning rate schedule, and 1e 3 learning rate. As in GIF, we train with random rotations ( 15 ), 224 224 crops, and horizontal flips.