Autoregressive Sequence Modeling for 3D Medical Image Representation

Authors: Siwen Wang, Churan Wang, Fei Gao, Lixian Su, Fandong Zhang, Yizhou Wang, Yizhou Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.
Researcher Affiliation Collaboration 1School of Computing and Data Science, The University of Hong Kong 2Center on Frontiers of Computing Studies, School of Computer Science, Nat l Eng. Research Center of Visual Technology, Peking University 3Deepwise AI Lab 4State Key Lab of General Artificial Intelligence, Inst. for Artificial Intelligence, Peking University
Pseudocode No The paper describes the methodology using textual explanations and mathematical formulas, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No More details of our implementation are provided in the supplemental material.
Open Datasets Yes For individual images, we collect 23,287 3D CT and MRI volumes from 12 public medical image datasets (Rib Frac (Jin et al. 2020), TCIA Covid19 (An et al. 2020), AMOS22 (Ji et al. 2022), ISLES2022 (Hernandez Petzsche et al. 2022), Abdomen CT-1K (Ma et al. 2021), Totalsegmentator (Wasserthal et al. 2023), Verse 2020 (Sekuboyina et al. 2021), RSNA-2022-CSFD (Flanders et al. 2022), RSNA2020-PED (Colak et al. 2021), STOTIC (Revel et al. 2021), FLARE22 (Ma et al. 2023), and FLARE23 (Ma et al. 2024)). For multimodal images, we collect 2,995 multimodal MRI scans from Bra TS 23 (La Bella et al. 2023), which is a series of challenges on brain MRI image analysis. Images belonging to the same semantic category are obtained from the Deep Lesion dataset (Yan et al. 2018), which contains 10,594 CT scans of 4,427 patients. We conducted downstream experiments in nine clinical tasks on public medical image datasets to evaluate the effectiveness of our method. These datasets cover a variety of organs, lesions, and modalities, including Task03 Liver (131 cases), Task06 Lung (64 cases), Task07 Pancreas (282 cases), Task08 Hepatic Vessel (303 cases), Task09 Spleen (41 cases), and Task10 Colon (126 cases) from Medical Segmentation Decathlon (MSD) (Antonelli et al. 2022), Left Atrium (LA) (Xiong et al. 2021) (100 cases), RICORD (Tsai et al. 2021) (330 cases) and LIDC-IDRI (Armato III et al. 2011) (1633 cases).
Dataset Splits Yes We randomly split the whole set into training, validation, and test at a ratio of 7:1:2 for the tasks on the MSD dataset. For the LA, RICORD, and LIDC-IDRI datasets, we follow the data split in (Yu et al. 2019), (Ye et al. 2024), and (Yang et al. 2023), respectively.
Hardware Specification No The paper does not provide specific details about the hardware used to run its experiments, such as exact GPU or CPU models.
Software Dependencies No The paper mentions using the Adam W optimizer and cosine learning rate decay scheduler, and specific model architectures like Vi T-B and UNETR, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use the Adam W optimizer and cosine learning rate decay scheduler for both pre-training and downstream tasks. In the pre-training stage, the initial learning rate is 1e-4, and we set 100K training steps with a batch size of 288. During the fine-tuning stage, the layer-wise learning rate decay strategy with the ratio of 0.75 is adopted for stabilizing the Vi T training. The evaluation metric in classification tasks is the area under the receiver operator curve (AUC), and accuracy (ACC). For segmentation tasks, we use Dice similarity as the evaluation metric.