Speech Watermarking with Discrete Intermediate Representations

Authors: Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, Zhou Zhao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our framework achieves state-of-the-art performance in robustness and imperceptibility, simultaneously. Moreover, our flexible frame-wise approach can serve as an efficient solution for both voice cloning detection and information hiding. Additionally, Discrete WM can encode 1 to 150 bits of watermark information within a 1-second speech clip, indicating its encoding capacity.
Researcher Affiliation Academia Zhejiang University EMAIL
Pseudocode No The paper describes methods using prose and mathematical formulations, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks are present.
Open Source Code No Demo https://Discrete WM.github.io/discrete wm. This link points to a demonstration page, not explicitly to the source code repository for the methodology described in the paper.
Open Datasets Yes Datasets. For training, we employ the standard training set of Libri TTS (Zen et al. 2019), which contains approximately 585 hours of English speech at 24k Hz sampling rate.
Dataset Splits Yes For training, we employ the standard training set of Libri TTS (Zen et al. 2019)... We randomly select 100 text transcriptions and 100 speech prompts from the Libri TTS test-clean set... The test set also includes all of the speech samples from the testclean set of Libri TTS.
Hardware Specification Yes The RTF (Real-Time Factor) evaluation is conducted with 1 NVIDIA A100 GPU and batch size 1.
Software Dependencies No The paper mentions software components and techniques like STFT, VQ-VAE, and GANs, but it does not specify any version numbers for libraries or tools used for implementation.
Experiment Setup Yes For the Short-Time Fourier Transform operation (STFT), we adopt a filter length of 400, a hop length of 80, and a window function applied to each frame with a length of 400... λadv is the hyper-parameter to balance the three terms, which is set to 10 2... We set the watermark ratio m of Discrete WM to 10%.