GaussMark: A Practical Approach for Structural Watermarking of Language Models

Authors: Adam Block, Alexander Rakhlin, Ayush Sekhari

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide formal statistical bounds on the validity and power of our procedure and, through an extensive suite of experiments, demonstrate that Gauss Mark is reliable, efficient, relatively robust to corruption, and can be instantiated with essentially no loss in model quality. In Section 4, we describe key empirical results providing a comprehensive evaluation of Gauss Mark on a variety of modern LMs, demonstrating its detectability and lack of effect on model quality.
Researcher Affiliation Collaboration 1Microsoft Research, New York, USA 2Department of Computer Science, Columbia University, New York, USA 3Department of Computer Science, MIT, Cambridge, MA, USA 4Boston University, MA, USA.
Pseudocode Yes Algorithm 1 Gauss Mark.Generate ... Algorithm 2 Gauss Mark.Detect
Open Source Code No The paper mentions using 'Hugging Face repository (Wolf et al., 2020a) to load our models, v LLM (Kwon et al., 2023) for generation, and Py Torch (Paszke et al., 2019) for watermark detection.' This refers to third-party tools used in their experiments, not the release of their own implementation code for Gauss Mark.
Open Datasets Yes As has become standard in recent empirical evaluations of watermarking (Kirchenbauer et al., 2023a; Kuditipudi et al., 2023; Lau et al., 2024; Pan et al., 2024), we use the realnewslike split of the C4 dataset (Raffel et al., 2020) as prompts for generation. To evaluate the effect of Gauss Mark on text quality, we employ three benchmarks: Super GLUE (Wang et al., 2019), GSM 8K (Cobbe et al., 2021), and Alpaca Eval 2.0 (Dubois et al., 2024; Li et al., 2023b).
Dataset Splits Yes We use the realnewslike split of the C4 dataset (Raffel et al., 2020) as prompts for generation. We use the same 1K prompts for all models and all watermarking keys in order to make the comparison fair.
Hardware Specification Yes All of our experiments are run on 40GB NVIDIA A100 GPUs, with each model chosen sufficiently small that a single GPU suffices for both generation and detection.
Software Dependencies No The paper mentions using 'Hugging Face repository (Wolf et al., 2020a)', 'v LLM (Kwon et al., 2023)', and 'Py Torch (Paszke et al., 2019)'. However, it does not specify the version numbers for these software dependencies, only citing the papers that introduced them.
Experiment Setup Yes Key Experimental Details. The Gauss Mark procedure described in Algorithms 1 and 2 can be used essentially out of the box, requiring careful selection of two key hyperparameters: the variance of the Gaussian noise, σ, and the specific parameter θ. ... In all of our experiments, we used 3 seeds for watermarking the model and generating tokens. The watermarking parameters used to construct the plots in the main body of the papers are summarized in Table 2. ... For each of the language models under consideration, we choose θ to be a single MLP layer (cf. Appendix C).