Context-aware Dynamic Pruning for Speech Foundation Models

Authors: Masao Someki, Yifan Peng, Siddhant Arora, Markus Müller, Athanasios Mouchtaris, Grant Strimel, Jing Liu, Shinji Watanabe

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrated that we could successfully reduce inference time by approximately 30% while maintaining accuracy in multilingual/multi-task scenarios. We also found that the obtained pruned structure offers meaningful interpretations based on the context, e.g., taskrelated information emerging as the dominant factor for efficient pruning. 4 EXPERIMENTS
Researcher Affiliation Collaboration Masao Someki, Yifan Peng, Siddhant Arora, Shinji Watanabe Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL, EMAIL, Markus M uller, Thanasis Mouchtaris, Grant Strimel, Jing Liu Neural Efficiency Science Amazon Artificial General Intelligence Pittsburgh, PA 15203, USA EMAIL
Pseudocode Yes Algorithm 1 Gate Predictor xpooled Average(x GP) Average over time dimension, xpooled RD x Concat(C, xpooled) Concatenate conditional info, x RD+Dconf logit Reshape(G(x)) Reshape G(x) to logit RL 2 g SGSE(logit, axis = 1) Compute SGSE function g g[:, 1] Select second column from Gumbel-Softmax output
Open Source Code Yes You can download the OWSM-v3.1 we employed in this experiment from the huggingface hub 1 All of our experiments are conducted with ESPnet (Watanabe et al., 2018). Based on the training configuration of OWSM-v3.1, we added or modified the following configuration: 1https://huggingface.co/espnet/owsm_v3.1_ebf 2https://huggingface.co/espnet/owsm_v3.1_ebf/blob/main/exp/s2t_train_ s2t_ebf_conv2d_size1024_e18_d18_piecewise_lr2e-4_warmup60k_flashattn_raw_ bpe50000/config.yaml
Open Datasets Yes This study employs the Europarl-ST (Iranzo-S anchez et al., 2020) dataset to evaluate model performance across multiple languages. The corpus was compiled from debates held in the European Parliament between 2008 and 2012. We utilized version 1.1 of the dataset, which comprises speech data in nine languages. For our experiments, we selected German, French, and Italian, which consist of approximately 20 hours of speech data. As the europarl-ST dataset is not part of the OWSM training data, we deemed it suitable for evaluating the model under multi-lingual and multi-task settings. We developed a baseline on Peng et al. (2023b) using a Transformer-encoder model trained on Libri Speech (Panayotov et al., 2015) and fine-tuned it on the Libri Speech-100h subset. We created a new dataset by integrating Voxforge (Voxforge.org) with Europarl-ST, covering the same languages.
Dataset Splits Yes We developed a baseline on Peng et al. (2023b) using a Transformer-encoder model trained on Libri Speech (Panayotov et al., 2015) and fine-tuned it on the Libri Speech-100h subset. The reproduced WER results on the test-clean set are shown in Table 5.
Hardware Specification Yes We measured the actual inference time of the pruned model to analyze the effect of pruning on inference speed. We used one A40 GPU and 16 CPUs for each inference run.
Software Dependencies No Explanation: The paper mentions using ESPnet and fvcore library but does not provide specific version numbers for these or other key software components like PyTorch, Python, or CUDA. The provided configurations are for model training parameters, not software versions.
Experiment Setup Yes Based on the training configuration of OWSM-v3.1, we added or modified the following configuration: encoder: e_branchformer_token_condition decoder: transformer_decoder_token_condition tau_ini: 1 tau_end: 0.1 tau_cooldown_steps: 15000 sparsity_init: 0.0 sparsity_end: 0.3 optim: adamw optim_conf: lr: 0.00001 weight_decay: 0.000001 scheduler: warmuplr scheduler_conf: warmup_steps: 6000. In all experiments, we performed auto-regressive decoding with a beam size of 5.