reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

Authors: Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, Engsiong Chng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on diverse datasets (Libri TTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 Hz rate while preserving reconstruction quality.
Researcher Affiliation	Collaboration	1Miro Mind, Singapore 2College of Computing & Data Science, Nanyang Technological University, Singapore 3Tongyi Speech Lab, Alibaba Group, Singapore.
Pseudocode	No	The paper describes the methodology in prose, including mathematical formulations for MBS-RVQ and architecture details, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Speech demos and codes are available 1 2. 1https://demos46.github.io/muffin/ 2https://github.com/dianwen-ng/MUFFIN
Open Datasets	Yes	We train our model on a modest collection of 1,600 hours of speech, music, and environmental sounds. For speech, we use Libri TTS (Zen et al., 2019) and EARS (Richter et al., 2024) datasets with expressive anechoic recordings of speech (585 and 100 hours, respectively). For music, we utilize Music4All (Santana et al., 2020) (910 hours). For environmental sounds, we use ESC50 (Piczak, 2015) (3 hours, 50 classes with 40 examples per class, loosely arranged into 5 major categories: animal, human, natural sounds, interior, and exterior sounds).
Dataset Splits	Yes	We use Libri TTS evaluation dataset, with 4,837 samples for test-clean and 5,120 for test-other. ... For each batch, 1-second audio segments are randomly selected from each instance and are zero-padded if shorter than 1 second. ... For compatibility with the ASR model, trained on 16 k Hz audio, we use the Libri Speech test-clean dataset (Panayotov et al., 2015), consisting of 2,620 samples that are resampled during reconstruction. ... For each speaker in the test set, one speech sample from the same speaker is randomly selected as the prompt, ensuring it is distinct from the speech to be synthesized.
Hardware Specification	Yes	In our experiments, models were trained on two A800 GPUs for 300K iterations with a learning rate of 2e-4 and a batch size of 20 per GPU. ... MUFFIN, operating at 12.5 Hz, is trained on 2-second audio segments and requires four A800 GPUs to support a batch size of 10 per GPU.
Software Dependencies	No	The paper mentions specific models like 'Whisper-large V3 model' and 'VALL-E', and a tool for MACs computation, but does not provide specific version numbers for libraries or software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	In our experiments, models were trained on two A800 GPUs for 300K iterations with a learning rate of 2e-4 and a batch size of 20 per GPU. All quantizers utilize a 9-bit code lookup from the EMA codebook. ... The overall loss objectives include a multiscale mel spectrogram reconstruction loss, calculated as the L1 distance between predicted and target mel-spectrograms over multiple time scales (i.e., a 64-bins mel-spectrogram derived from an STFT, with a window size of 2i and a hop length of 2i/4 for i = 7, 8, 9, 10, 11). ... We use three discriminators: a multiscale STFT discriminator (MS-STFT), a multi-period discriminator (MPD), and a multiscale discriminator (MSD) to enhance perceptual quality through adversarial learning. We adopt the Hinge GAN adversarial loss formulation and L1 feature matching loss. ... The commitment loss is defined as Lcommit = PK i=1 βi z(Bi) e z(Bi) q 2.