reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Do Multiple Instance Learning Models Transfer?

Authors: Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology, Cambridge MA, USA 2Harvard University, Boston MA, USA. Correspondence to: Daniel Shao <EMAIL>, Faisal Mahmood <Faisal EMAIL>.
Pseudocode	No	The paper describes methods in prose and through result tables and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Lastly, we provide a resource which standardizes the implementation of MIL models and provide model weights for FEATHER, a PC-108 pretrained ABMIL model at https://github.com/mahmoodlab/MIL-Lab.
Open Datasets	Yes	To this end, we exhaustively evaluate transfer performance of 11 MIL architectues across 19 publicly available benchmarks. We outline the evaluation protocol, pretraining and target datasets, and MIL architectures used for assessing MIL transfer below. In addition to the 19 tasks, we also include two pan-cancer tasks called PC-43 and PC-108 (Chen et al., 2024a). These represent 43-class and 108-class cancer subtyping tasks encompassing diverse malignancies from 17 organ types and are curated from the same hierarchical classification dataset for pretraining purposes only. EBRAINS (Roetzer-Pejrimovsky et al., 2022) NSCLC: The non-small cell lung carcinoma (NSCLC) subtyping task was a binary classification problem for distinguishing lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). The training data consisted of publicly available H&E WSIs from TCGA. PANDA (Bulten et al., 2022) BRACS (Brancati et al., 2021)
Dataset Splits	Yes	For evaluation, we assess MIL transfer performance on 19 publicly available CPath tasks, with training datasets ranging in size from 314 to 8,492 WSIs and in label complexity from 2 to 30 classes. EBRAINS (Roetzer-Pejrimovsky et al., 2022): We use label-stratified train/val/test splits (50% / 25% / 25%) provided by UNI(Chen et al., 2024a) with the same folds for both coarseand fine-grained tasks. NSCLC: We performed an 80% / 10% / 10% train/val/test split on the TCGA dataset for training and internal validation. PANDA (Bulten et al., 2022): We use the same train/val/test folds (80% / 10 % / 10%) as UNI. BRACS (Brancati et al., 2021): We use the official train/val/test folds (72% / 12% / 16%). Lung cancer biomarkers: each task siteand label-stratified into an approximate train/val/test splits (60% / 20% / 20%). Breast cancer biomarkers: each site-stratified and label-stratified in an approximate train/val/test splits (60% / 20% / 20%). Additionally, we evaluate on breast cancer core needle biopsies (BCNB, n = 1, 058) (Xu et al., 2021) in a label-stratified train/test split (90% / 10%). GBMLGG mutational subtyping (Brennan et al., 2013; Roetzer-Pejrimovsky et al., 2022): We use the UNI splits, which label-stratified TCGA-GBMLGG into a train/val/test split with a 47:22:31 ratio.
Hardware Specification	Yes	Experiments were performed across four NVIDIA RTX A4000s, three NVIDIA Ge Force RTX 2080 TIs GPUs, and three RTX 3090s, with a single GPU used per experiment.
Software Dependencies	No	The paper mentions "Pytorch’s native implementation" but does not specify its version or any other key software libraries with their version numbers.
Experiment Setup	Yes	Unless specified otherwise, all MIL models are implemented using the author s original model definition, trained with UNI features, and with standardized hyperparameters: Adam W optimzier with a learning rate of 1 10 4, cosine decay scheduler, and a maximum of 20 epochs with early stopping patience of 5 epochs on the validation set. For datasets with a validation set, we train with a maximum of 20 epochs with an early stopping patience of 5 epochs for a minimum of 10 epochs. For datasets without a validation set, we train for 10 epochs. We use cross-entropy loss with random class-weighted sampling and a batch size of 1. For regularization, we use a weight decay of 1 10 5, a dropout of 0.25 at every feedforward layer, and a dropout of 0.1 on the features from the pretrained encoder.