XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
Authors: Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI. We empirically show that our approach can achieve state-of-the-art performance in both detection and attribution with comparable perceptual quality and superior robustness. Furthermore, testing in a zero-shot manner on unseen generative editing transformations, our approach is the only one among the evaluated methods that maintains non-trivial detection under strong generative edits. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Lehigh University, Bethlehem, PA, USA 2Dolby Laboratories Inc., San Francisco, CA, USA. |
| Pseudocode | No | The paper describes the methodology using prose and mathematical formulations within the main text and appendix, but it does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We train the models on a mixed audio dataset of 4100 hours containing speech (3016 hours Vox Populi (Wang et al., 2021) and 100 hours Libri Speech (Panayotov et al., 2015)), music (9 hours Music Caps (Agostinelli et al., 2023) and 880 hours Free Music Archive (Defferrard et al., 2016)), and sound effects (98 hours Audio Set (Gemmeke et al., 2017)). |
| Dataset Splits | Yes | For evaluation, we use a held-out test set from Music Caps of size 100. For each audio file, we embed 100 distinct messages, resulting in 10k unique watermarked audio samples. Each of these samples is then subjected to 16 different audio transformations, leading to a total of 160k evaluated instances. ... To boost the sampling of the transformation efficiency, we update the sampling probability of each transformation every 1000 steps on the validation set, adjusting it based on the validation accuracy of each transformation. |
| Hardware Specification | No | This work is also supported by the Delta/Delta AI systems at NCSA and the Bridges-2 system at PSC through allocation CIS240308 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296. |
| Software Dependencies | No | We use the Adam optimizer (Kingma & Ba, 2015) ... and Exponential Moving Average (EMA) (Tarvainen & Valpola, 2017) ... The watermark generator consists of a waveform encoder and decoder, both utilizing components from En Codec (D efossez et al., 2023). |
| Experiment Setup | Yes | Following prior works (San Roman et al., 2024; Chen et al., 2023), we use a sampling rate of 16 k Hz and one-second mono samples for training (T = 16000) under 16 diverse audio editing transformations. ... The loss weights are set as: λTF = 1, λadv = 1, λℓ1 = 0.1, λmsspec = 2, λdetect = λmessage = 10. We use the Adam optimizer (Kingma & Ba, 2015) with learning rate 1e-5, β1 = 0.4, β2 = 0.9, and Exponential Moving Average (EMA) (Tarvainen & Valpola, 2017) with decay factor of 0.99 updated at every step. We train for 73k steps with batch size 16 and latent size H = 32. |