Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification
Authors: Yunzhen Feng, Elvis Dohmatob, Pu Yang, Franรงois Charton, Julia Kempe
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with practical tasks computing matrix eigenvalues with transformers and news summarization with LLMs which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance. |
| Researcher Affiliation | Collaboration | Yunzhen Feng1,2, , Elvis Dohmatob1,3,4, Pu Yang5, Francois Charton1 Julia Kempe1,2 1Meta FAIR 2New York University 3Concordia University 4Mila 5Peking University |
| Pseudocode | No | The paper describes methods and theoretical insights through mathematical formulations and textual descriptions, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'We leverage the code base provided by Charton (2022) at https://github.com/facebookresearch/LAWT under the license CC BY-NC 4.0.' and 'We leverage the official implementation in Huggingface 2 for training, under the license Apache 2.0.' However, these refer to code from prior work or third-party libraries, not the authors' own implementation for the methodology described in this paper. |
| Open Datasets | Yes | We utilize the English summarization subset of the XLSUM dataset (Hasan et al., 2021), the largest publicly available summarization dataset, consisting of 307,000 training samples and 11,500 test samples. |
| Dataset Splits | Yes | We utilize the English summarization subset of the XLSUM dataset (Hasan et al., 2021), the largest publicly available summarization dataset, consisting of 307,000 training samples and 11,500 test samples. |
| Hardware Specification | Yes | We leverage a V100 GPU with 32GB of memory for all experiments involving linear algebra. |
| Software Dependencies | No | The paper mentions leveraging 'the official implementation in Huggingface' for training and using 'Adam optimizer' without specifying any version numbers for these software components or libraries. |
| Experiment Setup | Yes | The synthesized data generator is trained on a limited sample of 200,000 examples with Adam for 65 epochs. [...] We train sequence-to-sequence transformers (Vaswani et al., 2017), with 4 layers in the encoder, and one in the decoder, 512 dimensions and 8 attention heads, to minimize a cross-entropy loss, using the Adam optimizer (Kingma & Ba, 2014), with a fixed learning rate of 5 10 5, after an initial linear warm-up phase over the first 10,000 optimization steps. The model is trained for 400 epochs before overfitting. |