reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval

Authors: Pengcheng Jiang, Cao (Danica) Xiao, Minhao Jiang, Parminder Bhatia, Taha Kass-Hout, Jimeng Sun, Jiawei Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
Researcher Affiliation	Collaboration	UIUC GE Health Care
Pseudocode	Yes	Algorithm 1 Dynamic Graph Retrieval and Augmentation
Open Source Code	Yes	2Our code is available at: https://github.com/pat-jj/KARE
Open Datasets	Yes	We utilize the publicly available MIMIC-III (Johnson et al., 2016) (v1.4) and MIMIC-IV (Johnson et al., 2020) (v2.0) EHR datasets
Dataset Splits	Yes	Both datasets are split into training, validation, and test sets in a 0.8/0.1/0.1 ratio by patient, ensuring that all samples from the same patient are confined to a single subset, preventing data leakage.
Hardware Specification	Yes	The experiments were conducted on a system with an AMD EPYC 7513 32-Core Processor and 1.0 TB of RAM. The setup includes eight NVIDIA A100 80GB PCIe GPUs, each with 81920 MiB of memory, providing a total of 640 GB GPU memory. The system s root partition has 32 GB of storage.
Software Dependencies	Yes	Our fine-tuning framework is implemented using the TRL (von Werra et al., 2020), Transformers (Wolf et al., 2020), and Flash Attention-2 (Dao, 2024) Python libraries. We use Mistral-7B-Instruct-v0.3 (Jiang et al., 2023) as our local LLM... For dense retrieval from Pub Med abstracts, we utilize the local embedding model Nomic (dimension = 768) (Nussbaum et al., 2024). We use Amazon Bedrock11 to access the Claude model. The optimal cosine distance thresholds θe and θr are both found to be 0.14, resulting in 513,867 triples in total after clustering. We employ Graspy (Chung et al., 2019) to implement the hierarchical Leiden algorithm, setting the maximum size for each top-level community (max cluster size) to 5. Using Claude 3.5 Sonnet as the LLM, we generate 147,264 community summaries (including both general and theme-specific summaries) with the prompts shown in Figure 12 and 13.
Experiment Setup	Yes	Parameter Value model name or path mistralai/Mistral-7B-Instruct-v0.3 torch dtype bfloat16 use flash attention 2 true preprocessing num workers 12 bf16 true gradient accumulation steps 4 gradient checkpointing true learning rate 5.0e-06 max seq length 6000 num train epochs 3 per device train batch size 1 lr scheduler type cosine warmup ratio 0.1