Memorization Sinks: Isolating Memorization during LLM Training
Authors: Gaurav Rohit Ghosal, Pratyush Maini, Aditi Raghunathan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train 360M and 1.7B Smol LM models (Allal et al., 2025) on the Slim Pajama dataset (Shen et al., 2024), with repeated sequences drawn from Tiny Stories. Standard training leads to strong memorization where repeated sequences show much lower loss than held-out ones. With Mem Sinks, dropping the memorization components closes over 50% of this loss gap, mitigating memorization. Furthermore, Mem Sinks (without memorization components) matches the validation loss of standard training and significantly outperforms a deduplication baseline, preserving the benefits of repeated data for generalization. This provides a proof-of-concept that Mem Sinks can disentangle memorization from generalization in realistic settings (Section 5.2). |
| Researcher Affiliation | Collaboration | 1Department of Machine Learning, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 2Dataology AI, Redwood City, CA, USA. Correspondence to: Gaurav Ghosal <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations within the main text and appendices, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source our code at http://github.com/grghosal/Mem Sinks. |
| Open Datasets | Yes | We train 360M and 1.7B Smol LM models (Allal et al., 2025) on the Slim Pajama dataset (Shen et al., 2024), with repeated sequences drawn from Tiny Stories. We conduct our experiments in a controlled setting using a subset of the Tiny Stories dataset (Eldan & Li, 2023). |
| Dataset Splits | No | The paper mentions using a 'validation loss' and comparing against 'held-out ones' and 'validation Tiny Stories data', implying the existence of evaluation sets. However, it does not provide specific details on how the datasets were formally split into training, validation, and test sets (e.g., percentages, sample counts, or explicit splitting methodologies). |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions building its implementation off of the 'Lit GPT library (AI, 2023)' but does not provide specific version numbers for this or any other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We train a GPT-2-Medium like architecture with embedding dimension 1024 and a 4 times expansion in the MLP layer. We used 24 layers, the resulting model had approximately 344 M parameters. We set the hyperparameters for our training as shown in Table 1 [Max Learning Rate {6e-5,6e-4,6e-3}, Weight Decay {1e-5,1e-3,1e-1}, Min Learning Rate Max Learning Rate / 10, LR Decay Steps Total Training Steps]. For Mem Sinks, we tuned over hyperparameter choices given in Table 3 [g {0.1,0.3,0.5,0.7}, p {0.1,0.3,0.5}]. |