TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
Authors: Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Tech Singer significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Experiments show that our model excels in generating high-quality, technique-controlled singing voices. |
| Researcher Affiliation | Academia | Zhejiang University EMAIL |
| Pseudocode | Yes | For the pseudo-code of the algorithm, please refer to Algorithm 1 and Algorithm 2 provided in Appendix B.1. |
| Open Source Code | Yes | Code https://github.com/gwx314/Tech Singer |
| Open Datasets | Yes | We use the GTSinger dataset (Zhang et al. 2024c), focusing on its Chinese, English, Spanish, German, and French subsets. Additionally, we collect and annotate a 30-hour Chinese dataset with two singers and four technique annotations (e.g., intensity, mixed-falsetto, breathy, bubble) at the phone and sentence levels. Additionally, to further expand the dataset, we use a trained technique predictor and glissando judgment rule to annotate the M4Singer dataset at the phoneme level, which is used under the CC BY-NC-SA 4.0 license. |
| Dataset Splits | Yes | Finally, we randomly select 804 segments covering different singers and techniques as a test set. |
| Hardware Specification | Yes | In the first stage, training is performed for 200k steps with an NVIDIA 2080 Ti GPU, and in the second stage, for 120k steps. |
| Software Dependencies | No | The paper mentions several tools and frameworks used (e.g., pypinyin, ARPA standard, Montreal Forced Aligner, RMVPE, Hi Fi-GAN vocoder), but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | In this experiment, the number of training steps for the F0 and Mel vector field estimator is 100 steps. Their architectures are based on non-causal Wave Net architecture (van den Oord et al. 2016). The number of the technique detector Squeezeformer layers and the technique predictor Transformer layers are both 2. In the first stage, training is performed for 200k steps with an NVIDIA 2080 Ti GPU, and in the second stage, for 120k steps. We train the technique detector and predictor for 120k and 80k steps. Further details are provided in the appendix B.2. |