Strong Model Collapse

Authors: Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical findings are empirically verified through experiments on language models and neural networks for images.
Researcher Affiliation Collaboration 1Meta FAIR 2Concordia University 3Mila 4NYU 5UCLA Work done while interning at Meta. Correspondence to EMAIL
Pseudocode No The paper describes methods and analyses through mathematical formulations and textual descriptions. No explicit pseudocode or algorithm blocks are present.
Open Source Code No The paper mentions the generation process for the Babi Stories dataset is detailed in the Git Hub repository of Zhang et al. (2024a), but does not state that the authors' own implementation code for the methodology described in this paper is available.
Open Datasets Yes Toy settings, including random projections model on Gaussian data, and shallow networks fully trained on the MNIST dataset (Deng, 2012). Realistic setting of GPT-2 models trained on Babi Stories (Zhang et al., 2024a), a reproduction of Tiny Stories (Eldan & Li, 2023) using the Mixtral-8x7B open language model (Jiang et al., 2024)).
Dataset Splits Yes The dataset comprises a training set of 2,200,000 stories and a validation set of 22,000 stories, created by prompting the Mistral-8x7B model. [...] A validation set is used to select the best checkpoint, and evaluation is conducted on the test set using the clean labels.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments.
Software Dependencies No The paper mentions using a 'GPT-2-small model' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The regularization parameter λ is set to a very small value (10 8). The two-layer neural networks are trained using stochastic gradient descent (SGD) with a batch size of 128 and a learning rate of 0.1. The models are trained for 400 epochs to fully converge. During training, we applied a learning rate of 5 10 3, a dropout rate of 0.05, L2 weight decay of 0.1, and a warm-up phase of 2,000 iterations.