Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Authors: Taesoo Kim, Jinju Kim, Dong Chan Kim, Jong Hwan Ko, Gyeong-Moon Park

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers voices while maintaining high quality for other speakers. (Abstract). See also Section 5. Experiment, which details quantitative and qualitative evaluations, baseline comparisons, and analysis on datasets.
Researcher Affiliation Collaboration 1Department of Electrical and Computer Engineering, Sungkyunkwan University 2KT Corporation 3Visiting fellow at Carnegie Mellon University 4Department of Artificial Intelligence, Korea University. The authors are affiliated with universities (Sungkyunkwan University, Carnegie Mellon University, Korea University) and a corporation (KT Corporation), indicating a collaboration.
Pseudocode No The paper describes methods in prose, such as in Section 4.1 'Approach: Guided Unlearning' and Section 4.2 'Teacher-Guided Unlearning', without presenting structured pseudocode or algorithm blocks.
Open Source Code No The demo is available at https://speechunlearn.github.io/ (Footnote 1 on page 3). This provides access to a demo, but not explicitly to the source code for the methodology described in the paper.
Open Datasets Yes We utilize Libri Heavy, an English speech corpus of 50,000 hours derived from Libri Light (Kahn et al., 2020)... (Section 5.1). To evaluate the performance... we used the Libri Speech test-clean set (Panayotov et al., 2015). (Section 5.1). For the experiment in Table 13, we randomly selected 1 speaker as forget set from Libri TTS (Zen et al., 2019) corpus. (Section 5.1)
Dataset Splits Yes For each speaker, 5 minutes of speech audio were randomly chosen for the evaluation set, with the remaining data used for the training set. (Section 5.1). To create the forget set, 10 speakers were randomly selected from the dataset... For each selected speaker, approximately 300 seconds of audio was randomly chosen as the evaluation set, while the remaining audio was designated for the unlearning training set. (Appendix A.1)
Hardware Specification No The paper does not explicitly mention specific hardware details such as GPU models (e.g., NVIDIA A100, RTX series), CPU models, or specific cloud computing resources used for training or inference.
Software Dependencies No We utilized the torchdiffeq package (Chen, 2018), which offers both fixed and adaptive step ODE solvers, using the default midpoint solver. (Appendix A.6). While a package is mentioned, a specific version number is not provided.
Experiment Setup Yes The duration predictor... is trained for 600K steps. The Adam optimizer was employed with a peak learning rate of 1e-4, linearly warmed up over the first 5K steps and decayed afterward. (Appendix A.4). We trained the original Voice model for 500K steps. Each mini-batch consisted of 75-second audio segments, and the Adam optimizer was employed with a peak learning rate of 1e-4, linearly warmed up over the first 5K steps and decayed afterward. (Appendix A.5). The TGU model was trained for 145K steps for 1 and 10K steps for 2... To facilitate the unlearning process, samples from the forget set xf were randomly selected with a 20% probability in each mini-batch. (Appendix B.1). where λ, a hyper-parameter that controls the weighting between the losses, is set to 0.2. (Section 4.2). where α is fixed at 0.7, as specified in the original paper. (Appendix A.6).