Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding
Authors: Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, Engsiong Chng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on diverse datasets (Libri TTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 Hz rate while preserving reconstruction quality. |
| Researcher Affiliation | Collaboration | 1Miro Mind, Singapore 2College of Computing & Data Science, Nanyang Technological University, Singapore 3Tongyi Speech Lab, Alibaba Group, Singapore. |
| Pseudocode | No | The paper describes the methodology in prose, including mathematical formulations for MBS-RVQ and architecture details, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Speech demos and codes are available 1 2. 1https://demos46.github.io/muffin/ 2https://github.com/dianwen-ng/MUFFIN |
| Open Datasets | Yes | We train our model on a modest collection of 1,600 hours of speech, music, and environmental sounds. For speech, we use Libri TTS (Zen et al., 2019) and EARS (Richter et al., 2024) datasets with expressive anechoic recordings of speech (585 and 100 hours, respectively). For music, we utilize Music4All (Santana et al., 2020) (910 hours). For environmental sounds, we use ESC50 (Piczak, 2015) (3 hours, 50 classes with 40 examples per class, loosely arranged into 5 major categories: animal, human, natural sounds, interior, and exterior sounds). |
| Dataset Splits | Yes | We use Libri TTS evaluation dataset, with 4,837 samples for test-clean and 5,120 for test-other. ... For each batch, 1-second audio segments are randomly selected from each instance and are zero-padded if shorter than 1 second. ... For compatibility with the ASR model, trained on 16 k Hz audio, we use the Libri Speech test-clean dataset (Panayotov et al., 2015), consisting of 2,620 samples that are resampled during reconstruction. ... For each speaker in the test set, one speech sample from the same speaker is randomly selected as the prompt, ensuring it is distinct from the speech to be synthesized. |
| Hardware Specification | Yes | In our experiments, models were trained on two A800 GPUs for 300K iterations with a learning rate of 2e-4 and a batch size of 20 per GPU. ... MUFFIN, operating at 12.5 Hz, is trained on 2-second audio segments and requires four A800 GPUs to support a batch size of 10 per GPU. |
| Software Dependencies | No | The paper mentions specific models like 'Whisper-large V3 model' and 'VALL-E', and a tool for MACs computation, but does not provide specific version numbers for libraries or software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | In our experiments, models were trained on two A800 GPUs for 300K iterations with a learning rate of 2e-4 and a batch size of 20 per GPU. All quantizers utilize a 9-bit code lookup from the EMA codebook. ... The overall loss objectives include a multiscale mel spectrogram reconstruction loss, calculated as the L1 distance between predicted and target mel-spectrograms over multiple time scales (i.e., a 64-bins mel-spectrogram derived from an STFT, with a window size of 2i and a hop length of 2i/4 for i = 7, 8, 9, 10, 11). ... We use three discriminators: a multiscale STFT discriminator (MS-STFT), a multi-period discriminator (MPD), and a multiscale discriminator (MSD) to enhance perceptual quality through adversarial learning. We adopt the Hinge GAN adversarial loss formulation and L1 feature matching loss. ... The commitment loss is defined as Lcommit = PK i=1 βi z(Bi) e z(Bi) q 2. |