Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation
Authors: Ziqian Ning, Shuai Wang, Yuepeng Jiang, Jixun Yao, Lei He, Shifeng Pan, Jie Ding, Lei Xie
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that Freestyler produces high-quality rapping voice generation with enhanced naturalness and strong alignment with accompanying beats, both stylistically and rhythmically. Experiments conducted on our collected rap dataset show that Freestyler generates high-quality rap that fits the accompaniment. Experimental results from objective and subjective evaluations demonstrate the effectiveness of Freestyler. |
| Researcher Affiliation | Collaboration | Ziqian Ning1,2, Shuai Wang3, Yuepeng Jiang1, Jixun Yao1, Lei He2, Shifeng Pan2, Jie Ding2, Lei Xie1* 1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi an, China 2 Microsoft, China 3 Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China |
| Pseudocode | No | The paper describes the methodology in text and uses diagrams (Figure 1, Figure 2, Figure 3) to illustrate the model architecture and data flow, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We present Rap Bank, a large volume rap dataset with comprehensive data-processing pipeline, suitable for model training. Both the data and the processing pipeline are publically available on Hugging Face1 and Github2. 2https://github.com/NZqian/Rap Bank |
| Open Datasets | Yes | Due to the scarcity of publicly available rap datasets, we also present Rap Bank, a rap song dataset collected from the internet, alongside a meticulously designed processing pipeline. Given the lack of publicly available rap datasets, we collected a large volume of rap songs from the internet and designed a meticulous pipeline for data cleaning, processing, and filtering, resulting in a dataset we have named Rap Bank. Both the data and the processing pipeline are publically available on Hugging Face1 and Github2. 1https://huggingface.co/datasets/zqning/Rap Bank |
| Dataset Splits | Yes | Dataset We utilize the English subset of Rap Bank to train the LM, which contains approximately 58, 200 songs with a total duration of 3,800 hours. After processing, we get the Basic, Standard and Premium subsets containing 1, 321, 295 and 58 hours of data respectively. We employ the entire Rap Bank to train the CFM model as it does not require any labels. We randomly reserved 200 samples for evaluation, with no singer overlapping with the training set. These samples are human-annotated to get the ground truth lyrics. |
| Hardware Specification | Yes | We train the LLa MA model using 4 NVIDIA V100 GPUs with a batch size of 16 and gradient accumulation of 4. The conditional flow matching model for semanticto-spectrogram generation contains 129M parameters and is also trained using 4 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions several tools and models like LLaMA (Touvron et al. 2023), Big VGAN (Lee et al. 2023) V2, Wav2Vec XLS-R (Conneau et al. 2021), G2P phonemizer (Bernard and Titeux 2021), BS-Ro Former (Lu et al. 2024), Web RTC Voice Activity Detector, and Whisper (Radford et al. 2023). While specific versions are sometimes indicated for models (e.g., Big VGAN-V2), the paper does not list specific version numbers for general programming languages or common libraries (e.g., Python, PyTorch, CUDA) required for replication. |
| Experiment Setup | Yes | We build a 6-layer LLa MA (Touvron et al. 2023) for lyricsto-semantic modeling, with 116M parameters. As mentioned earlier, to mitigate the train-inference mismatch of lengths in vocal-accompaniment pairs, a masking strategy is applied probabilistically there is a 50% chance that the entire accompaniment condition will be masked, and for the other 50% chance, a mask will be applied to a random length of the latter half of the accompaniment. We first pre-train the LLa MA model on the Basic subset, followed by sequential supervised finetuning (SFT) on both the Standard and Premium subsets. We train the LLa MA model using 4 NVIDIA V100 GPUs with a batch size of 16 and gradient accumulation of 4. The conditional flow matching model for semanticto-spectrogram generation contains 129M parameters and is also trained using 4 NVIDIA V100 GPUs. The batch size and gradient accumulation are 64 and 4, respectively. Each data segment is fixed at ten seconds in length, with shorter segments being padded and longer segments truncated. The sample steps is set to 20. For audio restoration, we employ the pre-trained Big VGAN-V2 44.1 k Hz version6. The number of K-means clusters is set to 1024, and the accompaniment feature shift K is set to 150 (3 secs). |