SING: Spatial Context in Large Language Model for Next-Gen Wearables
Authors: Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72 a substantial improvement compared to the 88.52 median error in existing work with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median Do A error of 16 . Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park, USA. Correspondence to: Nirupam Roy <EMAIL>. |
| Pseudocode | No | The paper describes steps in regular paragraph text without structured formatting, and contains mathematical equations but no clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing their own source code or a direct link to a code repository for the methodology described in this paper. |
| Open Datasets | Yes | For this study, we utilized the Librispeech dataset (Panayotov et al., 2015), a publicly available corpus of high-quality English speech, sampled at 16KHz. The dataset provides phonetically diverse speech recordings, large vocabulary, and clean and noisy subsets, making it ideal for speech recognition and spatial analysis under different acoustic conditions (Words, 2025). For the application of spatial ASR, we generated a comprehensive 400-hour spatial speech dataset. We pick 500 original samples from the Libri Speech dataset, and convoluted them with the impulse responses from 1 to 360 with 1 resolution. For the application of soundscaping, leveraging Libri Speech, we generated a comprehensive 2,000-hour spatial speech dataset that simulates scenarios involving 1 to 5 speakers speaking simultaneously. |
| Dataset Splits | No | The paper describes the creation of synthetic datasets and their characteristics, such as generating 400-hour and 2,000-hour datasets and distributing the number of speakers evenly with uniform coverage of Do A angles. However, it does not explicitly provide training, validation, or test dataset splits (e.g., percentages or sample counts) needed to reproduce the experiment. |
| Hardware Specification | Yes | The encoder is trained on an A100 GPU, with each epoch taking approximately 20 minutes. The training is completed on 3 H100 80 GB GPUs. |
| Software Dependencies | No | The paper mentions software components and models like 'Open AI s Whisper model' and 'LLa MA 3.2 3B model' and refers to 'Hugging face sft trainer documentation', but it does not provide specific version numbers for these or other key software dependencies required for replication. |
| Experiment Setup | Yes | SING presents specific hyperparameters for Do A encoder, LLM pretraining and finetuning in Table 3. Table 3 lists numerous hyperparameters such as "batch size", "num epochs", "learning rate", "loss function", "optimizer", "Lo RA rank (r)", "Lo RA alpha", and "Lo RA dropout" for the Do A Encoder, Num-of-speaker Encoder, LLM Pretraining, and LLM Fine-Tuning. |