reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Context-aware Dynamic Pruning for Speech Foundation Models

Authors: Masao Someki, Yifan Peng, Siddhant Arora, Markus Müller, Athanasios Mouchtaris, Grant Strimel, Jing Liu, Shinji Watanabe

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrated that we could successfully reduce inference time by approximately 30% while maintaining accuracy in multilingual/multi-task scenarios. We also found that the obtained pruned structure offers meaningful interpretations based on the context, e.g., taskrelated information emerging as the dominant factor for efficient pruning. 4 EXPERIMENTS
Researcher Affiliation	Collaboration	Masao Someki, Yifan Peng, Siddhant Arora, Shinji Watanabe Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL, EMAIL, Markus M uller, Thanasis Mouchtaris, Grant Strimel, Jing Liu Neural Efficiency Science Amazon Artificial General Intelligence Pittsburgh, PA 15203, USA EMAIL
Pseudocode	Yes	Algorithm 1 Gate Predictor xpooled Average(x GP) Average over time dimension, xpooled RD x Concat(C, xpooled) Concatenate conditional info, x RD+Dconf logit Reshape(G(x)) Reshape G(x) to logit RL 2 g SGSE(logit, axis = 1) Compute SGSE function g g[:, 1] Select second column from Gumbel-Softmax output
Open Source Code	Yes	You can download the OWSM-v3.1 we employed in this experiment from the huggingface hub 1 All of our experiments are conducted with ESPnet (Watanabe et al., 2018). Based on the training configuration of OWSM-v3.1, we added or modified the following configuration: 1https://huggingface.co/espnet/owsm_v3.1_ebf 2https://huggingface.co/espnet/owsm_v3.1_ebf/blob/main/exp/s2t_train_ s2t_ebf_conv2d_size1024_e18_d18_piecewise_lr2e-4_warmup60k_flashattn_raw_ bpe50000/config.yaml
Open Datasets	Yes	This study employs the Europarl-ST (Iranzo-S anchez et al., 2020) dataset to evaluate model performance across multiple languages. The corpus was compiled from debates held in the European Parliament between 2008 and 2012. We utilized version 1.1 of the dataset, which comprises speech data in nine languages. For our experiments, we selected German, French, and Italian, which consist of approximately 20 hours of speech data. As the europarl-ST dataset is not part of the OWSM training data, we deemed it suitable for evaluating the model under multi-lingual and multi-task settings. We developed a baseline on Peng et al. (2023b) using a Transformer-encoder model trained on Libri Speech (Panayotov et al., 2015) and fine-tuned it on the Libri Speech-100h subset. We created a new dataset by integrating Voxforge (Voxforge.org) with Europarl-ST, covering the same languages.
Dataset Splits	Yes	We developed a baseline on Peng et al. (2023b) using a Transformer-encoder model trained on Libri Speech (Panayotov et al., 2015) and fine-tuned it on the Libri Speech-100h subset. The reproduced WER results on the test-clean set are shown in Table 5.
Hardware Specification	Yes	We measured the actual inference time of the pruned model to analyze the effect of pruning on inference speed. We used one A40 GPU and 16 CPUs for each inference run.
Software Dependencies	No	Explanation: The paper mentions using ESPnet and fvcore library but does not provide specific version numbers for these or other key software components like PyTorch, Python, or CUDA. The provided configurations are for model training parameters, not software versions.
Experiment Setup	Yes	Based on the training configuration of OWSM-v3.1, we added or modified the following configuration: encoder: e_branchformer_token_condition decoder: transformer_decoder_token_condition tau_ini: 1 tau_end: 0.1 tau_cooldown_steps: 15000 sparsity_init: 0.0 sparsity_end: 0.3 optim: adamw optim_conf: lr: 0.00001 weight_decay: 0.000001 scheduler: warmuplr scheduler_conf: warmup_steps: 6000. In all experiments, we performed auto-regressive decoding with a beam size of 5.