Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Authors: Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On Image Net-100, DP generates 3.4 fewer samples and requires six times fewer iterations, while on Image Net-1k, it generates 8 fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work. |
| Researcher Affiliation | Collaboration | 1FAIR at Meta Montreal, Paris, and New York City labs 2Concordia University 3Mila 4Mc Gill University 5Canada CIFAR AI chair. |
| Pseudocode | Yes | Algorithm 1 Deliberate Practice for Synthetic Data Generation |
| Open Source Code | No | No explicit statement about providing concrete access to source code for the methodology described in this paper was found. |
| Open Datasets | Yes | Datasets. We validate our framework on two datasets. Image Net-100 (Tian et al., 2020; Sarıyıldız et al., 2023), a subset of Image Net-1k (Deng et al., 2009)... Performance is assessed on real Image Net (held-out) training and validation sets, as well as on Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-A (Hendrycks et al., 2021b) to measure out-of-distribution (OOD) generalization. |
| Dataset Splits | Yes | On Image Net-100, a subset of Image Net-1k (Deng et al., 2009), containing 100 classes and 5k validation examples, where the real training set (126,689 examples) serves as a held-out test set. We also conduct experiment Image Net-1k, using the 50k validation examples to monitor performance and reserving the real training set (1.3 million examples) as a held-out test set. |
| Hardware Specification | No | The paper mentions "generating a single image with entropy-guidance on an Nvidia H100 takes 1.82 longer than standard vanilla sampling" for a computational cost comparison, but does not specify the GPU model used for the main training experiments. It states "For Image Net-100, we train on 4 nodes, each with 8 GPUs with a batchsize of 64. For Image Net-1k, we train on 4 nodes, each with 8 GPUs with a batchsize of 128." without detailing the specific GPU model. |
| Software Dependencies | No | The paper mentions software components like "Warmup-Stable-Decay (WSD) learning rate scheduler (Hu et al., 2024)", "Adam W optimizer", "Mixup", and "Cut Mix", but does not provide specific version numbers for underlying libraries (e.g., PyTorch, TensorFlow) or programming language versions (e.g., Python 3.x). |
| Experiment Setup | Yes | For Image Net-100, we train on 4 nodes, each with 8 GPUs with a batchsize of 64. For Image Net-1k, we train on 4 nodes, each with 8 GPUs with a batchsize of 128. For all the experiments in this section we have a fixed and controlled setup. We train the models for 100k and 50k iterations for Image Net-1k and Image Net-100 respectively. For all the experiments, initial 10% of the iterations is done with linear-warmup and the last 20% of the iterations is for cool-down with Cosine Annealing. The intermediate steps are constant learning rate. For Image Net 100, the learning rate is 0.003 with an EMA momentum of 0.001. For Image Net-1k, the learning rate is set to 0.0016 with an EMA momentum of 0.001. We also use label smoothing with a value of 0.11. We use Mixup with an alpha of 0.5 and Cut Mix with an alpha of 1.0. Furthermore, we use the Adam W optimizer. |