reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Authors: Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. To address this issue, we propose Tab Cut Mix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce Tab Cut Mix Plus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models
Researcher Affiliation	Academia	1Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, USA 2Department of Computer Science & Engineering, Texas A&M University, College Station, USA 3Department of Biochemistry, Case Western Reserve University, Cleveland, USA 4Center for RNA Science and Therapeutics, Case Western Reserve University, Cleveland, USA 5Department of Biomedical Engineering, Case Western Reserve University, Cleveland, USA. Correspondence to: Jing Li <EMAIL>.
Pseudocode	Yes	The pseudo-code of our proposed algorithm is in Appendix C.
Open Source Code	Yes	Our code is available at https: //github.com/fangzy96/Tab Cut Mix.
Open Datasets	Yes	We use four real-world tabular datasets containing both numerical and categorical features: Adult, Default, Shoppers, and Magic. The detailed descriptions and overall statistics of these datasets are provided in Appendix D.1. 9https://archive.ics.uci.edu/dataset/2/adult 10https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients 11https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset 12https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope
Dataset Splits	Yes	For the Adult dataset, which has an official test set, we directly use it for testing, while the training set is split into training and validation sets in a ratio of 8 : 1. For the remaining datasets, the data is split into training, validation, and test sets with a ratio of 8:1:1, ensuring consistent splitting with a fixed random seed.
Hardware Specification	No	This work also made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.
Software Dependencies	No	We implement Tab Cut Mix and all the baseline methods with Py Torch. All the methods are optimized with Adam optimizer.
Experiment Setup	Yes	The following lists the hyperparameter search space for the XGBoost classifier applied during the MLE tasks, where grid search is used to determine the best parameter configurations: Number of estimators: {10, 50, 100} Minimum child weight: {5, 10, 20} Maximum tree depth: {1, 10} Gamma: {0.0, 1.0}