WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Authors: Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, zehan wang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, YIDI JIANG, Qian Chen, Siqi Zheng, Zhou Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive reconstruction experiments in the domains of speech, audio, and music. Wav Tokenizer exhibits competitive to superior performance across various objective and subjective metrics compared to SOTA models. We also evaluate Wav Tokenizer on semantic representation, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in Wav Tokenizer.
Researcher Affiliation Collaboration Zhejiang University & Alibaba Group & Fundamental AI Research (FAIR), Meta
Pseudocode No The paper describes the model architecture and training process in detail but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code Yes Code and Checkpoint: https://github.com/jishengpeng/WavTokenizer
Open Datasets Yes For the speech domain, we use Libri TTS (Zen et al., 2019), VCTK (Veaux et al., 2016), and a subset of Common Voice (Ardila et al., 2019)(3000 hours are randomly selected). For the audio domain, we utilize a subset of Audio Set (Gemmeke et al., 2017)(2000 hours are randomly selected); and for the music domain, we employ the Jamendo (Bogdanov et al., 2019) and Music DB (Rafii et al., 2017) datasets.
Dataset Splits Yes We evaluate speech reconstruction performance of codec in clean and noisy environments using the Libri TTS test-clean and test-other sets respectively, and assess audio and music reconstruction performance using the Audio Set eval and Music DB test sets respectively. On the other hand, for most confirmatory experiments, such as the ablation experiments, we evaluate the results with the Wav Tokenizer trained only on Libri TTS. We uniformly truncate excessively long segments in the training data to a fixed length of 10 seconds and subsequently perform a random crop of the waveform to obtain audio snippets of 3-second duration to be fed into Wav Tokenizer.
Hardware Specification Yes We train Wav Tokenizer up to 2 million iterations, with 1 million iterations allocated to training the generator and the discriminator respectively, on 8 NVIDIA A800 80G GPUs.
Software Dependencies No The paper mentions using the Adam W optimizer but does not specify any programming languages, libraries, or frameworks with their respective version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We train Wav Tokenizer up to 2 million iterations, with 1 million iterations allocated to training the generator and the discriminator respectively, on 8 NVIDIA A800 80G GPUs. Throughout the entire training process, all input speech, music, and audio samples are resampled to 24 k Hz, and the batch size is 40. We uniformly truncate excessively long segments in the training data to a fixed length of 10 seconds and subsequently perform a random crop of the waveform to obtain audio snippets of 3-second duration to be fed into Wav Tokenizer. Wav Tokenizer is optimized using the Adam W optimizer with an initial learning rate of 2e-4 and betas set to (0.9, 0.999). The learning rate is decayed based on a cosine schedule.