DiffPuter: Empowering Diffusion Models for Missing Data Imputation

Authors: Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip Yu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DIFFPUTER s superior performance. Notably, DIFFPUTER achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method. We conduct experiments on nine benchmark tabular datasets containing both continuous and discrete features under various missing data scenarios. Experimental results demonstrate the superior performance of DIFFPUTER across all settings and on almost all datasets.
Researcher Affiliation Academia 1Computer Science Department, University of Illinois at Chicago 2Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard EMAIL EMAIL
Pseudocode Yes Algorithm 1: M-step: Density Estimation using Diffusion Model Algorithm 2: E-step: Missing Data Imputation
Open Source Code Yes The code is available at https://github.com/hengruizhang98/DiffPuter.
Open Datasets Yes We evaluate the proposed DIFFPUTER on ten public real-world datasets of varying scales. We consider five datasets of only continuous features: California, Letter, Gesture, Magic, and Bean, and four datasets of both continuous and discrete features: Adult, Default, Shoppers, and News. The detailed information of these datasets is presented in Appendix D.2. ... all of them are available at Kaggle2 or the UCI Machine Learning repository3.
Dataset Splits Yes For each dataset, we use 70% as the training set, and the remaining 30% as the testing set. All methods are trained on the training set. The imputation is applied to both the missing values in the training set and the testing set. Consequently, the imputation of the training set is the in-sample setting, while imputing the testing set is the out-of-sample setting.
Hardware Specification Yes We conduct all experiments with: Operating System: Ubuntu 22.04.3 LTS CPU: Intel 13th Gen Intel(R) Core(TM) i9-13900K GPU: NVIDIA Ge Force RTX 4090 with 24 GB of Memory
Software Dependencies Yes Software: CUDA 12.2, Python 3.9.16, Py Torch (Paszke et al., 2019) 1.12.1
Experiment Setup Yes For the diffusion model, we set the maximum time T = 80, the noise level σ(t) = t, which is linear to t. The score/denoising neural network ϵ(xt, t) is implemented as a 5-layer MLP with hidden dimension 1024. ... When using the learned diffusion model for imputation, we set the number of discrete steps M = 50 and the number of sampling times per data sample N = 10. DIFFPUTER is implemented with Pytorch, and optimized using Adam (Kingma & Ba, 2015) optimizer with a learning rate of 1 10 4.