Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding
Authors: Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease (including several most deadly cancers) diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of f VLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively. Additionally, on the publicly available CT-RATE and Rad Chest CT benchmarks, our f VLM outperformed the current state-of-the-art methods with absolute AUC gains of 7.4% and 4.8%, respectively. |
| Researcher Affiliation | Collaboration | 1DAMO Academy, Alibaba Group 2The First Affiliated Hospital of College of Medicine, Zhejiang University, China 3Zhejiang University, China 4Westlake University, China 5Hupan Lab, 310023, China |
| Pseudocode | No | The paper describes the methodology using figures and text, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/alibaba-damo-academy/fvlm |
| Open Datasets | Yes | Moreover, on the publicly available CT-RATE and Rad-Chest CT datasets, our f VLM outperforms the state-of-the-art approach by 7.4% and 4.8% absolute AUC value gains, respectively. The details regarding these two datasets can be found in Hamamci et al. (2024) and Draelos et al. (2021). |
| Dataset Splits | Yes | We randomly split the dataset into training, validation and test sets of 64,476, 1,151, and 3,459 patients, respectively. |
| Hardware Specification | No | The paper states: 'All experiments are conducted on 8 NVIDIA A100 GPUs' in Appendix A.2, but it does not specify other hardware details such as CPU model, memory, or clock speeds. |
| Software Dependencies | No | The paper mentions specific models and tools like "vision transformer (ViT)", "BERT", "Totalsegmentator", and "Qwen 2.5", and specifies "AdamW optimizer". However, it does not provide specific version numbers for general software libraries or programming languages (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For Med VL-CT69K, the encoder for f VLM is initialized with an R-50 vision transformer (ViT) pre-trained on Image Net. We train the vision-language model for 100 epochs, using a batch size of 256. The learning rate is initialized to 1e-4 and is decayed by a factor of 0.1 at 60 and 90 epochs. We use the AdamW optimizer with a weight decay of 0.05. For fine-tuning, the learning rate is set to 2e-5. |