How to Synthesize Text Data without Model Collapse?
Authors: Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. ... We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance. |
| Researcher Affiliation | Collaboration | 1LUMIA Lab, Shanghai Jiao Tong University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Institute for Artificial Intelligence, Peking University 4Department of Electronic Engineering, Tsinghua University 5Shanghai Artificial Intelligence Laboratory. Correspondence to: Zhouhan Lin <EMAIL>, Zilong Zheng <EMAIL>, Bowen Zhou <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Token-level Editing |
| Open Source Code | Yes | 2Code repository available at https://github.com/Xuekai-Zhu/toedit. |
| Open Datasets | Yes | We use Dolma (Soldaini et al., 2024) as source human-produced data. We use Cosmopedia (Ben Allal et al., 2024) as the source synthetic data, which is distilled from Mixtral-8x7B-Instruct-v0.1 (Jiang et al., 2024). To ensure rigorous validation and prevent data leakage, we construct a three-tier evaluation: (1) the Paloma benchmark (Magnusson et al., 2023)... (2) comprehensive PPL evaluation across 22 subdomains from the Pile (Gao et al., 2020b)... For supervised fine-tuning, we fine-tune Llama-3-8B on instruction tuning and code reasoning tasks, evaluating on 9 downstream tasks. As for general instruction tuning tasks, we adopt instruction tuning datasets from (Xia et al., 2024) 5, including Co T (Wei et al., 2022) , FLAN v2 (Longpre et al., 2023), and Open Assistant 1 (Kopf et al., 2023). As for code-related reasoning tasks, we utilize OSS-Instruct-75K 6 and Evol-Instruct-110K 7. |
| Dataset Splits | No | The paper mentions using data mixtures for training and evaluating on external benchmarks, but does not explicitly provide training/validation/test splits for the primary datasets used in their experiments. For example: "We pre-train GPT-2 (Radford et al., 2019) and OLMo (Groeneveld et al., 2024) from scratch, using data mixtures containing 50B tokens each." and "For pre-training, we pre-train the 1B OLMo model (Groeneveld et al., 2024) from scratch using Dolma-sampled V6 (6B tokens) and evaluate on 8 general tasks." |
| Hardware Specification | Yes | We integrate the fast inference engine v LLM (Kwon et al., 2023), allowing the entire data editing process to be completed on a single 4090 GPU. |
| Software Dependencies | No | The paper mentions several software components and frameworks but does not specify version numbers for any of them. For example: "We integrate the fast inference engine v LLM (Kwon et al., 2023)... For GPT-2, we employed the official FSDP (Fully Sharded Data Parallel) framework provided by Torch for training. For OLMo3, we used the official open-source computational code, which also incorporates the FSDP framework alongside Flash Attention for acceleration. ...For LLa MA, we adopted the LLa MA-Factory framework to carry out the continual pre-training process." |
| Experiment Setup | Yes | The modification probability is set to p = 0.99. This means that we resample tokens in positions where the probability exceeds p, and the resampling is based on the conditional probability given the preceding context. ... We use top-k as the sampling strategy with k = 8. ...To balance model performance and data distribution preservation, we set p = 0.99 as threshold for our experiments. ...Therefore, we set k = 8 in our experiments. And a detailed case for token editing is provided in Table 12. |