reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Authors: Yunzhen Feng, Elvis Dohmatob, Pu Yang, François Charton, Julia Kempe

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with practical tasks computing matrix eigenvalues with transformers and news summarization with LLMs which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.
Researcher Affiliation	Collaboration	Yunzhen Feng1,2, , Elvis Dohmatob1,3,4, Pu Yang5, Francois Charton1 Julia Kempe1,2 1Meta FAIR 2New York University 3Concordia University 4Mila 5Peking University
Pseudocode	No	The paper describes methods and theoretical insights through mathematical formulations and textual descriptions, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: 'We leverage the code base provided by Charton (2022) at https://github.com/facebookresearch/LAWT under the license CC BY-NC 4.0.' and 'We leverage the official implementation in Huggingface 2 for training, under the license Apache 2.0.' However, these refer to code from prior work or third-party libraries, not the authors' own implementation for the methodology described in this paper.
Open Datasets	Yes	We utilize the English summarization subset of the XLSUM dataset (Hasan et al., 2021), the largest publicly available summarization dataset, consisting of 307,000 training samples and 11,500 test samples.
Dataset Splits	Yes	We utilize the English summarization subset of the XLSUM dataset (Hasan et al., 2021), the largest publicly available summarization dataset, consisting of 307,000 training samples and 11,500 test samples.
Hardware Specification	Yes	We leverage a V100 GPU with 32GB of memory for all experiments involving linear algebra.
Software Dependencies	No	The paper mentions leveraging 'the official implementation in Huggingface' for training and using 'Adam optimizer' without specifying any version numbers for these software components or libraries.
Experiment Setup	Yes	The synthesized data generator is trained on a limited sample of 200,000 examples with Adam for 65 epochs. [...] We train sequence-to-sequence transformers (Vaswani et al., 2017), with 4 layers in the encoder, and one in the decoder, 512 dimensions and 8 attention heads, to minimize a cross-entropy loss, using the Adam optimizer (Kingma & Ba, 2014), with a fixed learning rate of 5 10 5, after an initial linear warm-up phase over the first 10,000 optimization steps. The model is trained for 400 epochs before overfitting.