Strong Model Collapse
Authors: Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical findings are empirically verified through experiments on language models and neural networks for images. |
| Researcher Affiliation | Collaboration | 1Meta FAIR 2Concordia University 3Mila 4NYU 5UCLA Work done while interning at Meta. Correspondence to EMAIL |
| Pseudocode | No | The paper describes methods and analyses through mathematical formulations and textual descriptions. No explicit pseudocode or algorithm blocks are present. |
| Open Source Code | No | The paper mentions the generation process for the Babi Stories dataset is detailed in the Git Hub repository of Zhang et al. (2024a), but does not state that the authors' own implementation code for the methodology described in this paper is available. |
| Open Datasets | Yes | Toy settings, including random projections model on Gaussian data, and shallow networks fully trained on the MNIST dataset (Deng, 2012). Realistic setting of GPT-2 models trained on Babi Stories (Zhang et al., 2024a), a reproduction of Tiny Stories (Eldan & Li, 2023) using the Mixtral-8x7B open language model (Jiang et al., 2024)). |
| Dataset Splits | Yes | The dataset comprises a training set of 2,200,000 stories and a validation set of 22,000 stories, created by prompting the Mistral-8x7B model. [...] A validation set is used to select the best checkpoint, and evaluation is conducted on the test set using the clean labels. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running its experiments. |
| Software Dependencies | No | The paper mentions using a 'GPT-2-small model' but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The regularization parameter λ is set to a very small value (10 8). The two-layer neural networks are trained using stochastic gradient descent (SGD) with a batch size of 128 and a learning rate of 0.1. The models are trained for 400 epochs to fully converge. During training, we applied a learning rate of 5 10 3, a dropout rate of 0.05, L2 weight decay of 0.1, and a warm-up phase of 2,000 iterations. |