reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DiffPuter: Empowering Diffusion Models for Missing Data Imputation

Authors: Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DIFFPUTER s superior performance. Notably, DIFFPUTER achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method. We conduct experiments on nine benchmark tabular datasets containing both continuous and discrete features under various missing data scenarios. Experimental results demonstrate the superior performance of DIFFPUTER across all settings and on almost all datasets.
Researcher Affiliation	Academia	1Computer Science Department, University of Illinois at Chicago 2Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: M-step: Density Estimation using Diffusion Model Algorithm 2: E-step: Missing Data Imputation
Open Source Code	Yes	The code is available at https://github.com/hengruizhang98/DiffPuter.
Open Datasets	Yes	We evaluate the proposed DIFFPUTER on ten public real-world datasets of varying scales. We consider five datasets of only continuous features: California, Letter, Gesture, Magic, and Bean, and four datasets of both continuous and discrete features: Adult, Default, Shoppers, and News. The detailed information of these datasets is presented in Appendix D.2. ... all of them are available at Kaggle2 or the UCI Machine Learning repository3.
Dataset Splits	Yes	For each dataset, we use 70% as the training set, and the remaining 30% as the testing set. All methods are trained on the training set. The imputation is applied to both the missing values in the training set and the testing set. Consequently, the imputation of the training set is the in-sample setting, while imputing the testing set is the out-of-sample setting.
Hardware Specification	Yes	We conduct all experiments with: Operating System: Ubuntu 22.04.3 LTS CPU: Intel 13th Gen Intel(R) Core(TM) i9-13900K GPU: NVIDIA Ge Force RTX 4090 with 24 GB of Memory
Software Dependencies	Yes	Software: CUDA 12.2, Python 3.9.16, Py Torch (Paszke et al., 2019) 1.12.1
Experiment Setup	Yes	For the diffusion model, we set the maximum time T = 80, the noise level σ(t) = t, which is linear to t. The score/denoising neural network ϵ(xt, t) is implemented as a 5-layer MLP with hidden dimension 1024. ... When using the learned diffusion model for imputation, we set the number of discrete steps M = 50 and the number of sampling times per data sample N = 10. DIFFPUTER is implemented with Pytorch, and optimized using Adam (Kingma & Ba, 2015) optimizer with a learning rate of 1 10 4.