reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Medical Code Tokenizer

Authors: Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We integrate MEDTOK into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MEDTOK improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.32% on EHRShot, with the largest gains in drug recommendation.
Researcher Affiliation	Collaboration	1Department of Biomedical Informatics, Harvard University, Boston, MA, USA 2Digital Data, Sanofi, Cambridge, MA, USA. Correspondence to: Marinka Zitnik <EMAIL>.
Pseudocode	No	The paper describes the approach in prose and with mathematical equations, for example, in Section 3.1 'Multimodal tokenization' and Section 3.2 'Token packing', but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	We introduce MEDTOK (https://github.com/ mims-harvard/Med Tok) , a multimodal medical code tokenizer that integrates textual descriptions and graph-based dependencies from biomedical ontologies (Figure 1).
Open Datasets	Yes	We collected a total of 617,490 medical codes from eight commonly used coding systems: ICD-9 (Organization et al., 1988), ICD-10-CM (Fung et al., 2020), ICD-10-PCS (Averill et al., 2001), SNOMED CT (Donnelly et al., 2006), ATC (Miller & Britt, 1995), NDC (Palmer, 2006), CPT (Dotson, 2013), and Rx NORM (Nelson et al., 2011)... Each code is paired with a textual description from official documents and a subgraph from Prime KG (Chandak et al., 2023). We used three publicly available EHR datasets: MIMIC-III (Johnson et al., 2016), MIMICIV (Johnson et al., 2024), and EHRShot (Wornow et al., 2023).
Dataset Splits	Yes	Training and dataset splitting on MIMIC-IV adhered to the methodology outlined in the ETHOS paper. ... For the EHRShot dataset, inference was conducted on the full dataset for mortality and disease-related tasks, and on randomly selected, stratified samples of ten thousand instances for other tasks.
Hardware Specification	Yes	MEDTOK is training on a machine equipped with 4 NVIDIA H100. All experiments were conducted with 1 NVIDIA H100.
Software Dependencies	Yes	We implement MEDTOK using Python 3.9.19, PyTorch 2.3.1, Transformers 4.43.1, and Tokenizers 0.19.1.
Experiment Setup	Yes	During the training stage, we set the training step as 3000 with a global batch size of 1024, the dimension of quantized vectors is 64. In terms of the models weights, we freeze the text encoder in MEDTOK and the graph encoder is trainable during the training stage. It should be noted that we adopt a unified epoch number for all baselines, which is 50. ...we recommend setting λ = β = 0.1 for in-patient settings and λ = β = 0.01 for out-patient settings.