Long-Form Speech Generation with Spoken Language Models

Authors: Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, Rj Skerry-Ryan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We consider the generative modeling of speech over multiple minutes, a requirement for longform multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive Speech SSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. Speech SSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new longform setting, we also introduce: Libri Speech Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the Libri Speech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/ publications/speechssm/.
Researcher Affiliation Collaboration 1Google Deep Mind. 2Integrated Vision and Language Lab, KAIST. Correspondence to: Se Jin Park <EMAIL>, Julian Salazar <EMAIL>.
Pseudocode No The paper describes the architecture and methodology using text and diagrams (Figure 3, Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Speech samples, the Libri Speech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/ publications/speechssm/. We do not currently release model weights.
Open Datasets Yes As we found current spoken language evaluations uninformative, especially in this new longform setting, we also introduce: Libri Speech Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the Libri Speech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/ publications/speechssm/. ... We release examples of readand extemporaneous-style generations of up to 16 minutes in length,1 and we release the Libri Speech-Long evaluation dataset2 under a CC-BY 4.0 license. ... 2https://github.com/google-deepmind/librispeech-long/. ... We train on the unlab-60k split from Libri Light (Kahn et al., 2020).
Dataset Splits Yes Table 1. Statistics of our proposed Libri Speech-Long benchmark, which was generated with a maximum target duration of 4 minutes. Subset # Hours # Examples Avg. Dur. (s) # Chapters # Spkrs dev-clean 16.0 295 194.8 97 40 dev-other 9.5 188 182.4 91 33 test-clean 14.8 270 197.6 87 40 >3.5min 12.6 193 234.2 82 40 test-other 10.7 207 185.9 90 33 >3.5min 8.2 126 234.4 77 32 ... For win-rates, to mitigate length bias we only consider the 193 examples 3.5min (71% of test-clean). For MOS computations, we randomly selected 50 from these.
Hardware Specification Yes Each model is trained with 16 TPUs (v5p) and data parallelism for 100k steps with 768k tokens per batch... On the TPU v5e, the 2B Speech SSM decodes 16.4k tokens (roughly 10.9 minutes) in just over 100 seconds, a real-time factor well under 0.2x.
Software Dependencies No The paper mentions specific models like "Sentence BERT Mini LM-L6-v2" and "Gecko" for embeddings, and "wav2vec2-base-960h" for ASR, but does not provide specific version numbers for the underlying software libraries or frameworks (e.g., Python, PyTorch, TensorFlow) used to implement and run their models.
Experiment Setup Yes The default for Speech SSM is 4min (240s) during training, though we compare with target durations of 30s and 16min (960s) as well ( with 30s/16min segments ). We train -2B and -9B variants, corresponding to Recurrent Gemma. Each model is trained with 16 TPUs (v5p) and data parallelism for 100k steps with 768k tokens per batch, and a checkpoint is chosen via transcript perplexity on Libri Speech-Long dev-clean... We sample semantic (USM-v2) tokens with temperature 1. Then the Sound Storm model (speaker-prompted with the first 3 seconds of the prompt) and windowing (30s with 4s overlaps) give acoustic (Sound Stream) tokens... Speech SSM-2B w/ 30s, 4min, and 16min segments get 750, 5760, and 24k tokens per segment respectively, under USM-v2 s frame rate of 25Hz. ... Each version is trained on groups of 512, 128, and 32 sequences respectively, amounting to 768k tokens per batch. We train with the Adam optimizer with a learning rate of 5e-4 and weight decay 0.6 for 100k steps (with a warmup of 1k steps and a cosine decay schedule to 1/20th the learning rate).