GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data
Authors: Zhiteng Li, Lele Chen, Jerone Andrews, Yunhao Ba, Yulun Zhang, Alice Xiang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across several image classification tasks demonstrate the effectiveness of our approach. We evaluate Gen Data Agent in a supervised learning setting, following prior work (Yuan et al., 2023; Sarıyıldız et al., 2023; He et al., 2022). Our experiments cover two scenarios: (i) training a classifier using synthetic data alone, and (ii) using synthetic data to augment real data. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University, 2Sony AI |
| Pseudocode | Yes | Pseudocode for Gen Data Agent is presented in Algorithm 1. Algorithm 1 Gen Data Agent |
| Open Source Code | Yes | https://github.com/SonyResearch/GenDataAgent |
| Open Datasets | Yes | We evaluate Gen Data Agent on Image Net-100 (IN100) (Tian et al., 2020) and five fine-grained datasets: Oxford-IIIT Pets (Parkhi et al., 2012), Flowers-102 (Nilsback & Zisserman, 2008), Birdsnap (Berg et al., 2014), CUB-200-2011 (Wah et al., 2011), and Food-101 (Bossard et al., 2014). |
| Dataset Splits | Yes | Table 11: Dataset statistics. # Training Samples # Test Samples |
| Hardware Specification | No | The paper mentions 'GPU Hours' in Figure 4 as a metric for time taken, but does not specify any particular GPU models (e.g., NVIDIA A100, RTX 3090) or other hardware components used for the experiments. |
| Software Dependencies | No | The paper mentions models like Stable Diffusion v1.5, Llama-2, BLIP-2, and CLIP, but does not specify any programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or library versions (e.g., scikit-learn 1.0) that would be needed for replication. |
| Experiment Setup | Yes | Training hyperparameters for the downstream classifier are listed in Table 12. Table 12: Training hyperparameters for downstream classification. Pets CUB Flowers Birdsnap Food IN100 On-the-fly Iterations 20 20 20 20 20 20 Train Res Test Res 224 224 448 448 224 224 224 224 224 224 224 224 Epochs 200 200 200 200 200 200 Batch Size 128 8 64 8 128 8 128 8 128 8 128 8 Optimizer SGD SGD SGD SGD SGD SGD Learning Rate 0.1 0.2 0.1 0.1 0.1 0.1 LR Decay Multistep Multistep Multistep Multistep Multistep Multistep Decay Rate 0.2 0.2 0.2 0.2 0.2 0.2 Decay Epochs 50/100/150 50/100/150 50/100/150 50/100/150 50/100/150 50/100/150 Weight Decay 5e-4 5e-4 5e-4 5e-4 5e-4 5e-4 Mixed Precision |