Understanding and Mitigating Memorization in Diffusion Models for Tabular Data
Authors: Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. To address this issue, we propose Tab Cut Mix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce Tab Cut Mix Plus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models |
| Researcher Affiliation | Academia | 1Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, USA 2Department of Computer Science & Engineering, Texas A&M University, College Station, USA 3Department of Biochemistry, Case Western Reserve University, Cleveland, USA 4Center for RNA Science and Therapeutics, Case Western Reserve University, Cleveland, USA 5Department of Biomedical Engineering, Case Western Reserve University, Cleveland, USA. Correspondence to: Jing Li <EMAIL>. |
| Pseudocode | Yes | The pseudo-code of our proposed algorithm is in Appendix C. |
| Open Source Code | Yes | Our code is available at https: //github.com/fangzy96/Tab Cut Mix. |
| Open Datasets | Yes | We use four real-world tabular datasets containing both numerical and categorical features: Adult, Default, Shoppers, and Magic. The detailed descriptions and overall statistics of these datasets are provided in Appendix D.1. 9https://archive.ics.uci.edu/dataset/2/adult 10https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients 11https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset 12https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope |
| Dataset Splits | Yes | For the Adult dataset, which has an official test set, we directly use it for testing, while the training set is split into training and validation sets in a ratio of 8 : 1. For the remaining datasets, the data is split into training, validation, and test sets with a ratio of 8:1:1, ensuring consistent splitting with a fixed random seed. |
| Hardware Specification | No | This work also made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. |
| Software Dependencies | No | We implement Tab Cut Mix and all the baseline methods with Py Torch. All the methods are optimized with Adam optimizer. |
| Experiment Setup | Yes | The following lists the hyperparameter search space for the XGBoost classifier applied during the MLE tasks, where grid search is used to determine the best parameter configurations: Number of estimators: {10, 50, 100} Minimum child weight: {5, 10, 20} Maximum tree depth: {1, 10} Gamma: {0.0, 1.0} |