EHRDiff : Exploring Realistic EHR Synthesis with Diffusion Models

Authors: Hongyi Yuan, Songchi Zhou, Sheng Yu

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.
Researcher Affiliation Academia Hongyi Yuan EMAIL Center for Statistical Science Tsinghua University Songchi Zhou EMAIL Center for Statistical Science Tsinghua University Sheng Yu EMAIL Center for Statistical Science Tsinghua University
Pseudocode Yes Algorithm 1 Heun’s 2nd Method for Sampling Input: Time Step ti and noise level σti
Open Source Code Yes Codes are released in https://github.com/sczzz3/EHRDiff.git.
Open Datasets Yes In this work, we use a publicly available EHR database, MIMIC-III, to evaluate EHRDiff . Deidentified and comprehensive clinical EHR data is integrated into MIMIC-III (Johnson et al., 2016). Cin C2012 Data (Silva et al., 2012) is a dataset proposed to predict the mortality of ICU patients in the Cin C2012. PTB-ECG Data (Bousseljot et al., 1995) is a collection of ECG signals for heart disease diagnosis.
Dataset Splits Yes The final extracted number of EHRs is 46,520 and we randomly select 41,868 for model training while the rest are held out for evaluation. We use sets A and B in Cin C2023 Data as training and held-out testing sets respectively. The PTB-ECG Data is split with a ratio of 8:2 for training and held-out testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions general settings like 'training on synthetic data' without further hardware specifications.
Software Dependencies No The paper mentions using 'Light GBM (Ke et al., 2017) as classifiers' and 'MLP with ReLU (Nair & Hinton, 2010) activations' but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes In our experiments, for the diffusion noise schedule, we set σmin and σmax to be 0.02 and 80. ρ is set to 7 and the time step is discretized to N = 32. Pmean is set to 1.2 and Pstd is set to 1.2 for noise distribution in the training process. For Fθ in Equation 8, it is parameterized by an MLP with Re LU (Nair & Hinton, 2010) activations and the hidden states are set to [1024, 384, 384, 384, 1024]. For the baseline methods, we follow the settings reported in their papers. The reported standard errors marked with are calculated under 5 different runs.