TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching

Authors: Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Tech Singer significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Experiments show that our model excels in generating high-quality, technique-controlled singing voices.
Researcher Affiliation Academia Zhejiang University EMAIL
Pseudocode Yes For the pseudo-code of the algorithm, please refer to Algorithm 1 and Algorithm 2 provided in Appendix B.1.
Open Source Code Yes Code https://github.com/gwx314/Tech Singer
Open Datasets Yes We use the GTSinger dataset (Zhang et al. 2024c), focusing on its Chinese, English, Spanish, German, and French subsets. Additionally, we collect and annotate a 30-hour Chinese dataset with two singers and four technique annotations (e.g., intensity, mixed-falsetto, breathy, bubble) at the phone and sentence levels. Additionally, to further expand the dataset, we use a trained technique predictor and glissando judgment rule to annotate the M4Singer dataset at the phoneme level, which is used under the CC BY-NC-SA 4.0 license.
Dataset Splits Yes Finally, we randomly select 804 segments covering different singers and techniques as a test set.
Hardware Specification Yes In the first stage, training is performed for 200k steps with an NVIDIA 2080 Ti GPU, and in the second stage, for 120k steps.
Software Dependencies No The paper mentions several tools and frameworks used (e.g., pypinyin, ARPA standard, Montreal Forced Aligner, RMVPE, Hi Fi-GAN vocoder), but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes In this experiment, the number of training steps for the F0 and Mel vector field estimator is 100 steps. Their architectures are based on non-causal Wave Net architecture (van den Oord et al. 2016). The number of the technique detector Squeezeformer layers and the technique predictor Transformer layers are both 2. In the first stage, training is performed for 200k steps with an NVIDIA 2080 Ti GPU, and in the second stage, for 120k steps. We train the technique detector and predictor for 120k and 80k steps. Further details are provided in the appendix B.2.