Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Authors: Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang, Zhizheng Wu, Mingbo Ma

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo s effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. ... Section 4 EXPERIMENTS
Researcher Affiliation Collaboration Xueyao Zhang1 ... Zhizheng Wu1 ... 1The Chinese University of Hong Kong, Shenzhen ... Xiaohui Zhang2 ... Mingbo Ma2 2Meta AI
Pseudocode No The paper describes its methodology in prose and mathematical equations. It includes architectural diagrams in figures but does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper references various open-source tools and checkpoints used for baselines or components (e.g., Rep Codec, LLaMA, WavLM, Big VGAN, Amphion), but it does not contain an explicit statement from the authors about releasing the source code for their own proposed Vevo framework.
Open Datasets Yes We train the English-only models on 60K hours of ASR-transcribed English audiobooks, which is the same as the dataset used by the Voicebox English model [27]... For noisy data, which may include in-the-wild recordings and diverse recording devices, we use the Common Voice English dataset (CV) [68]. ... For comparison, we also examine the Hu BERT-ASRLarge5 model, which is fine-tuned from Hu BERT-Large for ASR task on Libri Speech [75].
Dataset Splits Yes We train the English-only models on 60K hours of ASR-transcribed English audiobooks... Both the content and content-style tokenizers are trained on a 100-hour subset randomly sampled from the full 60K-hour dataset. ... There are 700 evaluation samples in total: 200 from AB, 200 from CV, 150 from ACCENT, and 150 from EMOTION.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions several software components and tools (e.g., Rep Codec, LLaMA, WavLM, Big VGAN, Adam W, Adam) and provides links to their repositories or models. However, it does not specify exact version numbers for these software libraries or dependencies, which would be necessary for full reproducibility.
Experiment Setup Yes The architecture of our AR transformer is similar to LLa MA7 [35]. It has 12 layers, 16 attention heads, 2048/3072 embedding/feed-forward network (FFN) dimension. ... During training, we use Adam W [80] optimizer with a peak learning rate of 1e-4, linearly warmed up for 2K steps and decays over the rest of training. It is trained for 500K updates. During inference, we generate evaluation samples with specific sampling parameters: top-k is 25, top-p is 0.9, and temperature is 0.8. ... The transformer has 24 layers, 16 attention heads, 1024/4096 embedding/feed-forward network (FFN) dimension. ... We use Adam [81] optimizer with a peak learning rate of 1e-4, linearly warmed up for 5K steps and decays over the rest of training. It is trained for 500K updates.