reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Autoregressive Sequence Modeling for 3D Medical Image Representation

Authors: Siwen Wang, Churan Wang, Fei Gao, Lixian Su, Fandong Zhang, Yizhou Wang, Yizhou Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of our approach is demonstrated by the superior performance over others on nine downstream tasks in public datasets.
Researcher Affiliation	Collaboration	1School of Computing and Data Science, The University of Hong Kong 2Center on Frontiers of Computing Studies, School of Computer Science, Nat l Eng. Research Center of Visual Technology, Peking University 3Deepwise AI Lab 4State Key Lab of General Artificial Intelligence, Inst. for Artificial Intelligence, Peking University
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical formulas, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	More details of our implementation are provided in the supplemental material.
Open Datasets	Yes	For individual images, we collect 23,287 3D CT and MRI volumes from 12 public medical image datasets (Rib Frac (Jin et al. 2020), TCIA Covid19 (An et al. 2020), AMOS22 (Ji et al. 2022), ISLES2022 (Hernandez Petzsche et al. 2022), Abdomen CT-1K (Ma et al. 2021), Totalsegmentator (Wasserthal et al. 2023), Verse 2020 (Sekuboyina et al. 2021), RSNA-2022-CSFD (Flanders et al. 2022), RSNA2020-PED (Colak et al. 2021), STOTIC (Revel et al. 2021), FLARE22 (Ma et al. 2023), and FLARE23 (Ma et al. 2024)). For multimodal images, we collect 2,995 multimodal MRI scans from Bra TS 23 (La Bella et al. 2023), which is a series of challenges on brain MRI image analysis. Images belonging to the same semantic category are obtained from the Deep Lesion dataset (Yan et al. 2018), which contains 10,594 CT scans of 4,427 patients. We conducted downstream experiments in nine clinical tasks on public medical image datasets to evaluate the effectiveness of our method. These datasets cover a variety of organs, lesions, and modalities, including Task03 Liver (131 cases), Task06 Lung (64 cases), Task07 Pancreas (282 cases), Task08 Hepatic Vessel (303 cases), Task09 Spleen (41 cases), and Task10 Colon (126 cases) from Medical Segmentation Decathlon (MSD) (Antonelli et al. 2022), Left Atrium (LA) (Xiong et al. 2021) (100 cases), RICORD (Tsai et al. 2021) (330 cases) and LIDC-IDRI (Armato III et al. 2011) (1633 cases).
Dataset Splits	Yes	We randomly split the whole set into training, validation, and test at a ratio of 7:1:2 for the tasks on the MSD dataset. For the LA, RICORD, and LIDC-IDRI datasets, we follow the data split in (Yu et al. 2019), (Ye et al. 2024), and (Yang et al. 2023), respectively.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments, such as exact GPU or CPU models.
Software Dependencies	No	The paper mentions using the Adam W optimizer and cosine learning rate decay scheduler, and specific model architectures like Vi T-B and UNETR, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We use the Adam W optimizer and cosine learning rate decay scheduler for both pre-training and downstream tasks. In the pre-training stage, the initial learning rate is 1e-4, and we set 100K training steps with a batch size of 288. During the fine-tuning stage, the layer-wise learning rate decay strategy with the ratio of 0.75 is adopted for stabilizing the Vi T training. The evaluation metric in classification tasks is the area under the receiver operator curve (AUC), and accuracy (ACC). For segmentation tasks, we use Dice similarity as the evaluation metric.