Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation
Authors: PeiYuan Tang, Xiaodong Zhang, Chunze Yang, Haoran Yuan, Jun Sun, Danfeng Shan, Zijiang James Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our method, outperforming stateof-the-art methods by 3.3% on the average m Io U in syntheticto-real domain generalization. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Technology, Xi an Jiaotong University 2School of Computer Science and Technology, Xidian University 3Shaanxi Key Laboratory of Network and System Security, Xidian University 4Synkrotron, Inc. 5Singapore Management University 6University of Science and Technology of China |
| Pseudocode | No | The paper describes the method using textual explanations and network architecture diagrams (Figure 2, Figure 3), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/tpy001/VFMSeg |
| Open Datasets | Yes | Datasets. Following previous studies (Wei et al. 2024), we evaluate our method on both synthetic and real-world datasets. The synthetic dataset is GTAV (Richter et al. 2016), which contains 24,966 street-view images rendered by a computer game engine with the resolution of 1914x1052. For real-world datasets, we use Cityscapes (Cordts et al. 2016)... BDD100K (Yu et al. 2020) is another realworld dataset... The last real-world dataset we use is Mapillary (Neuhold et al. 2017)... |
| Dataset Splits | Yes | Cityscapes (Cordts et al. 2016), a large-scale semantic segmentation dataset for autonomous driving, with 2,975 training images and 500 validation images, all with a resolution of 2048 1024. BDD100K (Yu et al. 2020) is another realworld dataset that contains diverse urban driving scene images with the resolution of 1280 720. The last real-world dataset we use is Mapillary (Neuhold et al. 2017), which consists of highresolution images with a minimum resolution of 1920 1080 collected from around the world. BDD100K and Mapillary provide 1000 and 2000 validation images, respectively. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | Our implementation is based on the MMSegmentation framework. |
| Experiment Setup | Yes | Implementation Details. Our implementation is based on the MMSegmentation framework. We use the Adam W optimizer with learning rates of 1e-5 for the backbone and 1e-4 for all decode heads. Training is conducted for 40,000 iterations with a batch size of 2 and crop size of 512x512. We employ basic data augmentation techniques including random cropping, random horizontal flipping, photo-metric transformation and rare class sampling (Hoyer, Dai, and Van Gool 2022). During training, we set λ = 1.0, r = α = 32, and p = 0.2. During inference, we use a sliding window approach with a window size of 512x512 and a stride of 320. The θ and Cτ are set to 0.968 and 0.8 respectively. |