Guided Discrete Diffusion for Electronic Health Record Generation
Authors: Jun Han, Zixiang Chen, Yongqian Li, Yiwen Kou, Eran Halperin, Robert E. Tillman, Quanquan Gu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data. |
| Researcher Affiliation | Collaboration | Jun Han* EMAIL Optum AI, UHG; Zixiang Chen*, Yongqian Li, Yiwen Kou EMAIL Department of Computer Science, UCLA |
| Pseudocode | No | The paper describes the model architecture and procedures using mathematical equations and descriptive text, but does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | The paper mentions the open-source codebases for baseline models like EHRDiff and EHRMGAN, but it does not provide concrete access to the source code for the methodology described in this paper (EHR-D3PM). |
| Open Datasets | Yes | Public Datasets MIMIC-III (Johnson et al., 2016) includes deidentified patient EHRs from hospital stays. |
| Dataset Splits | Yes | MIMIC Dataset ... We have implemented an 80/20 split for training and testing purposes. Specifically, this allocates 12,862 records for testing and the remaining 51,451 for training. The first dataset, denoted by D1, includes a patient population of size 1,670,347. We split the whole dataset into 100K for validation, 2000K for testing and the rest 1,370, 347 for training. The second dataset, denoted by D2, includes a patient population of size 1,859,536. We split the whole dataset into 100K for validation, 2000K for testing and the rest 1,559,536 for training. |
| Hardware Specification | Yes | It takes less than three hours to finish training this model on A6000 with 48G memory. ... It takes one and half day to train one model on A100 with 80G memory. |
| Software Dependencies | No | The paper mentions using a 'light gradient boosting decision tree model (LGBM)' and 'adam W optimizer', but it does not provide specific version numbers for these or other key software components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | The hidden dimension 256. The number of multi-attention heads is 8. The number of transformer layers is 5. The number of diffusion steps is 500. In the optimization phase, we adopt adam W optimizer, and the weight decay in adam W is 1.e-5. The learning rate is 1e-4 and batch size is 256. The beta for expential LR in learning rate schedule is 0.99. The number of training epochs is 100. ... For the downstream tasks, we used a light gradient boosting decision tree model (LGBM) (Ke et al., 2017) as it had uniformly robust prediction performance on all downstream tasks. In all experiments, we set the hyper-parameters of LGBM as follows: n_estimators = 1000, learning_rate = 0.05 max_depth = 10, reg_alpha = 0.5, reg_lambda = 0.5, scale_pos_weight = 1, min_data_in_bin = 128. |