Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions

Authors: Yan Ru Pei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. In our experiments, we study various configurations of Centaurus model architectures (e.g. classifier, hourglass, multi-streams) and block configurations (e.g. depthwise, bottleneck, full). This is to showcase the structural flexibility of our model and adaptation to different tasks, reminiscent of classical CNNs.
Researcher Affiliation Industry Yan Ru Pei Brainchip Inc. Laguna Hills, CA 92653, USA EMAIL
Pseudocode Yes See Listing 1 in Appendix C for a minimal pseudocode of the bottleneck block operations with the order of operations optimized, along with benchmarking on an A100 GPU.
Open Source Code Yes The model source code is at github.com/Brainchip-Inc/Centaurus.
Open Datasets Yes We begin with the simple task of keyword spotting (KWS) with raw waveforms, using the Google Speech Commands 35-class subset (Warden, 2018)... The performance of the network is evaluated on the Voice Bank + DEMAND (VB-DMD) testset... In Table. 2, we report the performance and size of the base Centaurus model on Librispeech (Panayotov et al., 2015)... For the clean training samples, we use the processed VCTK and Libri Vox datasets that can be downloaded from the Microsoft DNS4 challenge. We also use the noise training samples from the DNS4 challenge as well, which contains the Audioset, Freesound, and DEMAND datasets.
Dataset Splits Yes The performance is evaluated on the SC35 testset... The performance is evaluated on the Voice Bank + DEMAND (VB-DMD) testset... In Table. 2, we report the performance and size of the base Centaurus model on Librispeech test/dev-sets... The training dataset includes the Libri Speech full 960h training set, and the Multilingual Libri Speech (MLS) English training set. To guarantee no data leakage, we check every sample in the training set and remove it if it is sufficiently similar to any sample in the development and testing sets.
Hardware Specification Yes We perform the benchmark in fp32 precision on 1 A100 40GB SXM4 with Py Torch 2.5.1 under CUDA 12.4... perform training on a single NVIDIA A30 GPU with a batch size of 512... perform training on a single NVIDIA A30 GPU with a batch size of 192... perform training with 8 40GB A100 with a batch size of 32 per GPU.
Software Dependencies Yes We perform the benchmark in fp32 precision on 1 A100 40GB SXM4 with Py Torch 2.5.1 under CUDA 12.4... Adam W optimizer with the Py Torch default configs... Additionally, we trained with automatic mixed precision (AMP) along with torch.compile, except for the FFT convolution operations which are performed in full fp32 precision and without compilation (due to inability to handle complex data types currently)... We initialize the B and C projection matrices with the standard Kaiming uniform distribution. All of our training runs and trials are done with Py Torch with torch.compile enabled except for operations involving complex numbers (e.g. FFTs). In addition, we enabled tensorfloat32 for matrix multiplications, and the opt einsum backend for all torch.einsum operations.
Experiment Setup Yes Unless otherwise mentioned, all of our network variants are trained with: Adam W optimizer with the Py Torch default configs a cosine decay scheduler with a linear warmup period equal to 0.01 of the total training steps, updating after every optimizer step gradient clip value of 1 layer normalization (over the feature dimension) with elementwise affine parameters Si LU activation no dropout layers except for the keyword-spotting network. For all our trials in this experiment, we train for 200 epochs use a learning rate of 0.01 with a weight decay 0.05 use a linear warmup period of 0.1 for the scheduler. Dropout1d with probability 0.1, only applied if the number of features is greater than 4 perform training on a single NVIDIA A30 GPU with a batch size of 512.