EHRDiff : Exploring Realistic EHR Synthesis with Diffusion Models
Authors: Hongyi Yuan, Songchi Zhou, Sheng Yu
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile. |
| Researcher Affiliation | Academia | Hongyi Yuan EMAIL Center for Statistical Science Tsinghua University Songchi Zhou EMAIL Center for Statistical Science Tsinghua University Sheng Yu EMAIL Center for Statistical Science Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Heun’s 2nd Method for Sampling Input: Time Step ti and noise level σti |
| Open Source Code | Yes | Codes are released in https://github.com/sczzz3/EHRDiff.git. |
| Open Datasets | Yes | In this work, we use a publicly available EHR database, MIMIC-III, to evaluate EHRDiff . Deidentified and comprehensive clinical EHR data is integrated into MIMIC-III (Johnson et al., 2016). Cin C2012 Data (Silva et al., 2012) is a dataset proposed to predict the mortality of ICU patients in the Cin C2012. PTB-ECG Data (Bousseljot et al., 1995) is a collection of ECG signals for heart disease diagnosis. |
| Dataset Splits | Yes | The final extracted number of EHRs is 46,520 and we randomly select 41,868 for model training while the rest are held out for evaluation. We use sets A and B in Cin C2023 Data as training and held-out testing sets respectively. The PTB-ECG Data is split with a ratio of 8:2 for training and held-out testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions general settings like 'training on synthetic data' without further hardware specifications. |
| Software Dependencies | No | The paper mentions using 'Light GBM (Ke et al., 2017) as classifiers' and 'MLP with ReLU (Nair & Hinton, 2010) activations' but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | In our experiments, for the diffusion noise schedule, we set σmin and σmax to be 0.02 and 80. ρ is set to 7 and the time step is discretized to N = 32. Pmean is set to 1.2 and Pstd is set to 1.2 for noise distribution in the training process. For Fθ in Equation 8, it is parameterized by an MLP with Re LU (Nair & Hinton, 2010) activations and the hidden states are set to [1024, 384, 384, 384, 1024]. For the baseline methods, we follow the settings reported in their papers. The reported standard errors marked with are calculated under 5 different runs. |