GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data

Authors: Zhiteng Li, Lele Chen, Jerone Andrews, Yunhao Ba, Yulun Zhang, Alice Xiang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across several image classification tasks demonstrate the effectiveness of our approach. We evaluate Gen Data Agent in a supervised learning setting, following prior work (Yuan et al., 2023; Sarıyıldız et al., 2023; He et al., 2022). Our experiments cover two scenarios: (i) training a classifier using synthetic data alone, and (ii) using synthetic data to augment real data.
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University, 2Sony AI
Pseudocode Yes Pseudocode for Gen Data Agent is presented in Algorithm 1. Algorithm 1 Gen Data Agent
Open Source Code Yes https://github.com/SonyResearch/GenDataAgent
Open Datasets Yes We evaluate Gen Data Agent on Image Net-100 (IN100) (Tian et al., 2020) and five fine-grained datasets: Oxford-IIIT Pets (Parkhi et al., 2012), Flowers-102 (Nilsback & Zisserman, 2008), Birdsnap (Berg et al., 2014), CUB-200-2011 (Wah et al., 2011), and Food-101 (Bossard et al., 2014).
Dataset Splits Yes Table 11: Dataset statistics. # Training Samples # Test Samples
Hardware Specification No The paper mentions 'GPU Hours' in Figure 4 as a metric for time taken, but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 3090) or other hardware components used for the experiments.
Software Dependencies No The paper mentions models like Stable Diffusion v1.5, Llama-2, BLIP-2, and CLIP, but does not specify any programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or library versions (e.g., scikit-learn 1.0) that would be needed for replication.
Experiment Setup Yes Training hyperparameters for the downstream classifier are listed in Table 12. Table 12: Training hyperparameters for downstream classification. Pets CUB Flowers Birdsnap Food IN100 On-the-fly Iterations 20 20 20 20 20 20 Train Res Test Res 224 224 448 448 224 224 224 224 224 224 224 224 Epochs 200 200 200 200 200 200 Batch Size 128 8 64 8 128 8 128 8 128 8 128 8 Optimizer SGD SGD SGD SGD SGD SGD Learning Rate 0.1 0.2 0.1 0.1 0.1 0.1 LR Decay Multistep Multistep Multistep Multistep Multistep Multistep Decay Rate 0.2 0.2 0.2 0.2 0.2 0.2 Decay Epochs 50/100/150 50/100/150 50/100/150 50/100/150 50/100/150 50/100/150 Weight Decay 5e-4 5e-4 5e-4 5e-4 5e-4 5e-4 Mixed Precision