Multimodal Medical Code Tokenizer
Authors: Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We integrate MEDTOK into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MEDTOK improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.32% on EHRShot, with the largest gains in drug recommendation. |
| Researcher Affiliation | Collaboration | 1Department of Biomedical Informatics, Harvard University, Boston, MA, USA 2Digital Data, Sanofi, Cambridge, MA, USA. Correspondence to: Marinka Zitnik <EMAIL>. |
| Pseudocode | No | The paper describes the approach in prose and with mathematical equations, for example, in Section 3.1 'Multimodal tokenization' and Section 3.2 'Token packing', but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | We introduce MEDTOK (https://github.com/ mims-harvard/Med Tok) , a multimodal medical code tokenizer that integrates textual descriptions and graph-based dependencies from biomedical ontologies (Figure 1). |
| Open Datasets | Yes | We collected a total of 617,490 medical codes from eight commonly used coding systems: ICD-9 (Organization et al., 1988), ICD-10-CM (Fung et al., 2020), ICD-10-PCS (Averill et al., 2001), SNOMED CT (Donnelly et al., 2006), ATC (Miller & Britt, 1995), NDC (Palmer, 2006), CPT (Dotson, 2013), and Rx NORM (Nelson et al., 2011)... Each code is paired with a textual description from official documents and a subgraph from Prime KG (Chandak et al., 2023). We used three publicly available EHR datasets: MIMIC-III (Johnson et al., 2016), MIMICIV (Johnson et al., 2024), and EHRShot (Wornow et al., 2023). |
| Dataset Splits | Yes | Training and dataset splitting on MIMIC-IV adhered to the methodology outlined in the ETHOS paper. ... For the EHRShot dataset, inference was conducted on the full dataset for mortality and disease-related tasks, and on randomly selected, stratified samples of ten thousand instances for other tasks. |
| Hardware Specification | Yes | MEDTOK is training on a machine equipped with 4 NVIDIA H100. All experiments were conducted with 1 NVIDIA H100. |
| Software Dependencies | Yes | We implement MEDTOK using Python 3.9.19, PyTorch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1. |
| Experiment Setup | Yes | During the training stage, we set the training step as 3000 with a global batch size of 1024, the dimension of quantized vectors is 64. In terms of the models weights, we freeze the text encoder in MEDTOK and the graph encoder is trainable during the training stage. It should be noted that we adopt a unified epoch number for all baselines, which is 50. ...we recommend setting λ = β = 0.1 for in-patient settings and λ = β = 0.01 for out-patient settings. |