MaskViM: Domain Generalized Semantic Segmentation with State Space Models

Authors: Jiahao Li, Yang Lu, Yuan Xie, Yanyun Qu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method. Experiments System Level Comparison Efficiency We compare our approach with Uper Net (Xiao et al. 2018) in efficiency as shown in Tab. 1. Cityscapes-to-Cityscapes-c Tab. 2 presents the performance comparison in Cityscapes-to-Cityscapes-c setting. Ablation Studies Mask in the Encoder We evaluate the impact of mask in the encoder, as shown in the first three lines of Tab. 6.
Researcher Affiliation Academia 1School of Informatics, Xiamen University 2Institute of Artificial Intelligence, Xiamen University 3Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University 4School of Computer Science and Technology, East China Normal University 5Chongqing Institute of East China Normal University
Pseudocode No The paper describes the methodology using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for their methodology, nor does it provide a link to a code repository. It mentions 'using the code from Mamba (Gu and Dao 2023)' for FLOPs computation, but this refers to a third-party tool, not their own implementation.
Open Datasets Yes Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method. In the field of visual representation learning, Vim (Zhu et al. 2024) proposes a novel generic vision backbone featuring bidirectional Mamba blocks, which mark image sequences with positional embeddings and compress visual representations using bidirectional state space models. The real and synthetic datasets are sourced from Mapillary and SYNTHIA validation sets, respectively. Corruptions, such as noise and fog, are introduced to Cityscapes validation set. Lee et al. (Lee et al. 2022) utilize Image Net data (Deng et al. 2009) as the auxiliary domain to conduct style transfer at the feature level.
Dataset Splits Yes The m Io U is calculated on BDD validation set via models trained on Cityscapes training set. Tab. 2 presents the performance comparison in Cityscapes-to-Cityscapes-c setting. Tab. 3 presents the performance comparison in the Cityscapes-to-others setting. Tab. 4 illustrates the performance comparison in the Mapillary-to-others setting. Tab. 5 depicts the performance comparison in the GTAV-to-others setting.
Hardware Specification No The paper does not provide specific details about the hardware used for training or inference, such as GPU or CPU models. It only mentions that "FLOPs are computed for inputs of size 512x512, using the code from Mamba (Gu and Dao 2023)" which is a measurement context, not experimental setup.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). It mentions using 'the code from Mamba (Gu and Dao 2023)' but not its specific version or any other library versions used in their implementation.
Experiment Setup Yes Considering that masking too many tokens may lead to performance degradation and incorrect mask positioning can also impair performance, we devise a mask loss, i.e., Lm = α1 Lr m + α2 Lp m, where α1 and α2 are weighting factors, to regulate both the ratio and position of mask as: s=1(λ 1 h w i,j=0 ms i,j)2, i,j=0(Mi,j ms i,j)2 (13) The coefficient λ (0, 1) modulates the mask ratio. The results indicate that a uniform λ for all Stages is better than an individual one for each Stage. Finally, we employ a uniform λ = 0.5 to regulate the ratio across all Stages. The optimal weighting strategy occurs when α1, α2 = 1.0.