Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Authors: Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On Image Net-100, DP generates 3.4 fewer samples and requires six times fewer iterations, while on Image Net-1k, it generates 8 fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.
Researcher Affiliation Collaboration 1FAIR at Meta Montreal, Paris, and New York City labs 2Concordia University 3Mila 4Mc Gill University 5Canada CIFAR AI chair.
Pseudocode Yes Algorithm 1 Deliberate Practice for Synthetic Data Generation
Open Source Code No No explicit statement about providing concrete access to source code for the methodology described in this paper was found.
Open Datasets Yes Datasets. We validate our framework on two datasets. Image Net-100 (Tian et al., 2020; Sarıyıldız et al., 2023), a subset of Image Net-1k (Deng et al., 2009)... Performance is assessed on real Image Net (held-out) training and validation sets, as well as on Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-R (Hendrycks et al., 2021a), and Image Net-A (Hendrycks et al., 2021b) to measure out-of-distribution (OOD) generalization.
Dataset Splits Yes On Image Net-100, a subset of Image Net-1k (Deng et al., 2009), containing 100 classes and 5k validation examples, where the real training set (126,689 examples) serves as a held-out test set. We also conduct experiment Image Net-1k, using the 50k validation examples to monitor performance and reserving the real training set (1.3 million examples) as a held-out test set.
Hardware Specification No The paper mentions "generating a single image with entropy-guidance on an Nvidia H100 takes 1.82 longer than standard vanilla sampling" for a computational cost comparison, but does not specify the GPU model used for the main training experiments. It states "For Image Net-100, we train on 4 nodes, each with 8 GPUs with a batchsize of 64. For Image Net-1k, we train on 4 nodes, each with 8 GPUs with a batchsize of 128." without detailing the specific GPU model.
Software Dependencies No The paper mentions software components like "Warmup-Stable-Decay (WSD) learning rate scheduler (Hu et al., 2024)", "Adam W optimizer", "Mixup", and "Cut Mix", but does not provide specific version numbers for underlying libraries (e.g., PyTorch, TensorFlow) or programming language versions (e.g., Python 3.x).
Experiment Setup Yes For Image Net-100, we train on 4 nodes, each with 8 GPUs with a batchsize of 64. For Image Net-1k, we train on 4 nodes, each with 8 GPUs with a batchsize of 128. For all the experiments in this section we have a fixed and controlled setup. We train the models for 100k and 50k iterations for Image Net-1k and Image Net-100 respectively. For all the experiments, initial 10% of the iterations is done with linear-warmup and the last 20% of the iterations is for cool-down with Cosine Annealing. The intermediate steps are constant learning rate. For Image Net 100, the learning rate is 0.003 with an EMA momentum of 0.001. For Image Net-1k, the learning rate is set to 0.0016 with an EMA momentum of 0.001. We also use label smoothing with a value of 0.11. We use Mixup with an alpha of 0.5 and Cut Mix with an alpha of 1.0. Furthermore, we use the Adam W optimizer.