Do Multiple Instance Learning Models Transfer?

Authors: Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch.
Researcher Affiliation Academia 1Massachusetts Institute of Technology, Cambridge MA, USA 2Harvard University, Boston MA, USA. Correspondence to: Daniel Shao <EMAIL>, Faisal Mahmood <Faisal EMAIL>.
Pseudocode No The paper describes methods in prose and through result tables and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Lastly, we provide a resource which standardizes the implementation of MIL models and provide model weights for FEATHER, a PC-108 pretrained ABMIL model at https://github.com/mahmoodlab/MIL-Lab.
Open Datasets Yes To this end, we exhaustively evaluate transfer performance of 11 MIL architectues across 19 publicly available benchmarks. We outline the evaluation protocol, pretraining and target datasets, and MIL architectures used for assessing MIL transfer below. In addition to the 19 tasks, we also include two pan-cancer tasks called PC-43 and PC-108 (Chen et al., 2024a). These represent 43-class and 108-class cancer subtyping tasks encompassing diverse malignancies from 17 organ types and are curated from the same hierarchical classification dataset for pretraining purposes only. EBRAINS (Roetzer-Pejrimovsky et al., 2022) NSCLC: The non-small cell lung carcinoma (NSCLC) subtyping task was a binary classification problem for distinguishing lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). The training data consisted of publicly available H&E WSIs from TCGA. PANDA (Bulten et al., 2022) BRACS (Brancati et al., 2021)
Dataset Splits Yes For evaluation, we assess MIL transfer performance on 19 publicly available CPath tasks, with training datasets ranging in size from 314 to 8,492 WSIs and in label complexity from 2 to 30 classes. EBRAINS (Roetzer-Pejrimovsky et al., 2022): We use label-stratified train/val/test splits (50% / 25% / 25%) provided by UNI(Chen et al., 2024a) with the same folds for both coarseand fine-grained tasks. NSCLC: We performed an 80% / 10% / 10% train/val/test split on the TCGA dataset for training and internal validation. PANDA (Bulten et al., 2022): We use the same train/val/test folds (80% / 10 % / 10%) as UNI. BRACS (Brancati et al., 2021): We use the official train/val/test folds (72% / 12% / 16%). Lung cancer biomarkers: each task siteand label-stratified into an approximate train/val/test splits (60% / 20% / 20%). Breast cancer biomarkers: each site-stratified and label-stratified in an approximate train/val/test splits (60% / 20% / 20%). Additionally, we evaluate on breast cancer core needle biopsies (BCNB, n = 1, 058) (Xu et al., 2021) in a label-stratified train/test split (90% / 10%). GBMLGG mutational subtyping (Brennan et al., 2013; Roetzer-Pejrimovsky et al., 2022): We use the UNI splits, which label-stratified TCGA-GBMLGG into a train/val/test split with a 47:22:31 ratio.
Hardware Specification Yes Experiments were performed across four NVIDIA RTX A4000s, three NVIDIA Ge Force RTX 2080 TIs GPUs, and three RTX 3090s, with a single GPU used per experiment.
Software Dependencies No The paper mentions "Pytorch’s native implementation" but does not specify its version or any other key software libraries with their version numbers.
Experiment Setup Yes Unless specified otherwise, all MIL models are implemented using the author s original model definition, trained with UNI features, and with standardized hyperparameters: Adam W optimzier with a learning rate of 1 10 4, cosine decay scheduler, and a maximum of 20 epochs with early stopping patience of 5 epochs on the validation set. For datasets with a validation set, we train with a maximum of 20 epochs with an early stopping patience of 5 epochs for a minimum of 10 epochs. For datasets without a validation set, we train for 10 epochs. We use cross-entropy loss with random class-weighted sampling and a batch size of 1. For regularization, we use a weight decay of 1 10 5, a dropout of 0.25 at every feedforward layer, and a dropout of 0.1 on the features from the pretrained encoder.