MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

Authors: Hongyu Ke, Jack Morris, Kentaro Oguchi, Xiaofei Cao, Yongkang Liu, Haoxin Wang, Yi Ding

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate Mam BEV s promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models. A thorough set of ablation studies is provided to showcase model scaling and other properties. We open-source our code 1 and provide a strong baseline and evaluation framework for future experimentation.
Researcher Affiliation Collaboration 1Georgia State University 2Info Tech Labs, Toyota Motor North America R&D EMAIL EMAIL
Pseudocode Yes A.3 ALGORITHMS The Pseudocode of our proposed Spatial Cross Mamba shows in Algorithm 1. The details of the Cross Quasi-Separable State Space Model (XQSSM) show in Algorithm 2.
Open Source Code Yes The code is available at https://github.com/amaigsu/Mam BEV. We open-source our code 1 and provide a strong baseline and evaluation framework for future experimentation. 1https://github.com/amai-gsu/Mam BEV
Open Datasets Yes We conduct our experiments using the nu Scenes dataset Caesar et al. (2020). The nu Scenes dataset is a large-sale autonomous driving dataset containing 1000 driving scenes from Boston and Singapore.
Dataset Splits No The paper mentions using the nu Scenes dataset but does not explicitly describe how the data was split into training, validation, or test sets for their experiments (e.g., specific percentages, counts, or a reference to predefined splits used by the authors).
Hardware Specification Yes We trained with an effective batch size of 32 with no gradient accumulation on 8 A100s for 30 epochs, truncated at 24 epochs. The FPS is the average number of samples per second processed by the model in evaluation mode on an RTX 4090 GPU.
Software Dependencies No The paper mentions using an Adam W optimizer and an automatic mixed precision optimizer wrapper, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes We used a learning rate of 8 10 4, with a linear warmup for 10% of the scheduled steps starting from 8 3 10 4 Following the warmup, the learning rate follows an epoch based cosine annealing schedule with a minimum learning rate of 8 10 7. We trained with an effective batch size of 32 with no gradient accumulation on 8 A100s for 30 epochs, truncated at 24 epochs. Starting from step 100 an exponential moving average according to the function w t = (1 0.0002)wt + 0.0002wt is applied to all weights. An Adam W optimizer with a 0.01 weight decay is used, and training employs an automatic mixed precision optimizer wrapper with an initial gradient scaling of 512. A 0.1 multiplier is applied to the learning rate of the backbone weights and the deformable attention sampling offsets Zhu et al. (2020).