reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memorization Sinks: Isolating Memorization during LLM Training

Authors: Gaurav Rohit Ghosal, Pratyush Maini, Aditi Raghunathan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train 360M and 1.7B Smol LM models (Allal et al., 2025) on the Slim Pajama dataset (Shen et al., 2024), with repeated sequences drawn from Tiny Stories. Standard training leads to strong memorization where repeated sequences show much lower loss than held-out ones. With Mem Sinks, dropping the memorization components closes over 50% of this loss gap, mitigating memorization. Furthermore, Mem Sinks (without memorization components) matches the validation loss of standard training and significantly outperforms a deduplication baseline, preserving the benefits of repeated data for generalization. This provides a proof-of-concept that Mem Sinks can disentangle memorization from generalization in realistic settings (Section 5.2).
Researcher Affiliation	Collaboration	1Department of Machine Learning, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 2Dataology AI, Redwood City, CA, USA. Correspondence to: Gaurav Ghosal <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose and mathematical formulations within the main text and appendices, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source our code at http://github.com/grghosal/Mem Sinks.
Open Datasets	Yes	We train 360M and 1.7B Smol LM models (Allal et al., 2025) on the Slim Pajama dataset (Shen et al., 2024), with repeated sequences drawn from Tiny Stories. We conduct our experiments in a controlled setting using a subset of the Tiny Stories dataset (Eldan & Li, 2023).
Dataset Splits	No	The paper mentions using a 'validation loss' and comparing against 'held-out ones' and 'validation Tiny Stories data', implying the existence of evaluation sets. However, it does not provide specific details on how the datasets were formally split into training, validation, and test sets (e.g., percentages, sample counts, or explicit splitting methodologies).
Hardware Specification	No	The paper does not specify the hardware used for running the experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions building its implementation off of the 'Lit GPT library (AI, 2023)' but does not provide specific version numbers for this or any other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We train a GPT-2-Medium like architecture with embedding dimension 1024 and a 4 times expansion in the MLP layer. We used 24 layers, the resulting model had approximately 344 M parameters. We set the hyperparameters for our training as shown in Table 1 [Max Learning Rate {6e-5,6e-4,6e-3}, Weight Decay {1e-5,1e-3,1e-1}, Min Learning Rate Max Learning Rate / 10, LR Decay Steps Total Training Steps]. For Mem Sinks, we tuned over hyperparameter choices given in Table 3 [g {0.1,0.3,0.5,0.7}, p {0.1,0.3,0.5}].