Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions
Authors: Yan Ru Pei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. In our experiments, we study various configurations of Centaurus model architectures (e.g. classifier, hourglass, multi-streams) and block configurations (e.g. depthwise, bottleneck, full). This is to showcase the structural flexibility of our model and adaptation to different tasks, reminiscent of classical CNNs. |
| Researcher Affiliation | Industry | Yan Ru Pei Brainchip Inc. Laguna Hills, CA 92653, USA EMAIL |
| Pseudocode | Yes | See Listing 1 in Appendix C for a minimal pseudocode of the bottleneck block operations with the order of operations optimized, along with benchmarking on an A100 GPU. |
| Open Source Code | Yes | The model source code is at github.com/Brainchip-Inc/Centaurus. |
| Open Datasets | Yes | We begin with the simple task of keyword spotting (KWS) with raw waveforms, using the Google Speech Commands 35-class subset (Warden, 2018)... The performance of the network is evaluated on the Voice Bank + DEMAND (VB-DMD) testset... In Table. 2, we report the performance and size of the base Centaurus model on Librispeech (Panayotov et al., 2015)... For the clean training samples, we use the processed VCTK and Libri Vox datasets that can be downloaded from the Microsoft DNS4 challenge. We also use the noise training samples from the DNS4 challenge as well, which contains the Audioset, Freesound, and DEMAND datasets. |
| Dataset Splits | Yes | The performance is evaluated on the SC35 testset... The performance is evaluated on the Voice Bank + DEMAND (VB-DMD) testset... In Table. 2, we report the performance and size of the base Centaurus model on Librispeech test/dev-sets... The training dataset includes the Libri Speech full 960h training set, and the Multilingual Libri Speech (MLS) English training set. To guarantee no data leakage, we check every sample in the training set and remove it if it is sufficiently similar to any sample in the development and testing sets. |
| Hardware Specification | Yes | We perform the benchmark in fp32 precision on 1 A100 40GB SXM4 with Py Torch 2.5.1 under CUDA 12.4... perform training on a single NVIDIA A30 GPU with a batch size of 512... perform training on a single NVIDIA A30 GPU with a batch size of 192... perform training with 8 40GB A100 with a batch size of 32 per GPU. |
| Software Dependencies | Yes | We perform the benchmark in fp32 precision on 1 A100 40GB SXM4 with Py Torch 2.5.1 under CUDA 12.4... Adam W optimizer with the Py Torch default configs... Additionally, we trained with automatic mixed precision (AMP) along with torch.compile, except for the FFT convolution operations which are performed in full fp32 precision and without compilation (due to inability to handle complex data types currently)... We initialize the B and C projection matrices with the standard Kaiming uniform distribution. All of our training runs and trials are done with Py Torch with torch.compile enabled except for operations involving complex numbers (e.g. FFTs). In addition, we enabled tensorfloat32 for matrix multiplications, and the opt einsum backend for all torch.einsum operations. |
| Experiment Setup | Yes | Unless otherwise mentioned, all of our network variants are trained with: Adam W optimizer with the Py Torch default configs a cosine decay scheduler with a linear warmup period equal to 0.01 of the total training steps, updating after every optimizer step gradient clip value of 1 layer normalization (over the feature dimension) with elementwise affine parameters Si LU activation no dropout layers except for the keyword-spotting network. For all our trials in this experiment, we train for 200 epochs use a learning rate of 0.01 with a weight decay 0.05 use a linear warmup period of 0.1 for the scheduler. Dropout1d with probability 0.1, only applied if the number of features is greater than 4 perform training on a single NVIDIA A30 GPU with a batch size of 512. |