SING: Spatial Context in Large Language Model for Next-Gen Wearables

Authors: Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72 a substantial improvement compared to the 88.52 median error in existing work with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median Do A error of 16 . Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
Researcher Affiliation Academia 1Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Nirupam Roy <EMAIL>.
Pseudocode No The paper describes steps in regular paragraph text without structured formatting, and contains mathematical equations but no clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing their own source code or a direct link to a code repository for the methodology described in this paper.
Open Datasets Yes For this study, we utilized the Librispeech dataset (Panayotov et al., 2015), a publicly available corpus of high-quality English speech, sampled at 16KHz. The dataset provides phonetically diverse speech recordings, large vocabulary, and clean and noisy subsets, making it ideal for speech recognition and spatial analysis under different acoustic conditions (Words, 2025). For the application of spatial ASR, we generated a comprehensive 400-hour spatial speech dataset. We pick 500 original samples from the Libri Speech dataset, and convoluted them with the impulse responses from 1 to 360 with 1 resolution. For the application of soundscaping, leveraging Libri Speech, we generated a comprehensive 2,000-hour spatial speech dataset that simulates scenarios involving 1 to 5 speakers speaking simultaneously.
Dataset Splits No The paper describes the creation of synthetic datasets and their characteristics, such as generating 400-hour and 2,000-hour datasets and distributing the number of speakers evenly with uniform coverage of Do A angles. However, it does not explicitly provide training, validation, or test dataset splits (e.g., percentages or sample counts) needed to reproduce the experiment.
Hardware Specification Yes The encoder is trained on an A100 GPU, with each epoch taking approximately 20 minutes. The training is completed on 3 H100 80 GB GPUs.
Software Dependencies No The paper mentions software components and models like 'Open AI s Whisper model' and 'LLa MA 3.2 3B model' and refers to 'Hugging face sft trainer documentation', but it does not provide specific version numbers for these or other key software dependencies required for replication.
Experiment Setup Yes SING presents specific hyperparameters for Do A encoder, LLM pretraining and finetuning in Table 3. Table 3 lists numerous hyperparameters such as "batch size", "num epochs", "learning rate", "loss function", "optimizer", "Lo RA rank (r)", "Lo RA alpha", and "Lo RA dropout" for the Do A Encoder, Num-of-speaker Encoder, LLM Pretraining, and LLM Fine-Tuning.