reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GaussMark: A Practical Approach for Structural Watermarking of Language Models

Authors: Adam Block, Alexander Rakhlin, Ayush Sekhari

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide formal statistical bounds on the validity and power of our procedure and, through an extensive suite of experiments, demonstrate that Gauss Mark is reliable, efficient, relatively robust to corruption, and can be instantiated with essentially no loss in model quality. In Section 4, we describe key empirical results providing a comprehensive evaluation of Gauss Mark on a variety of modern LMs, demonstrating its detectability and lack of effect on model quality.
Researcher Affiliation	Collaboration	1Microsoft Research, New York, USA 2Department of Computer Science, Columbia University, New York, USA 3Department of Computer Science, MIT, Cambridge, MA, USA 4Boston University, MA, USA.
Pseudocode	Yes	Algorithm 1 Gauss Mark.Generate ... Algorithm 2 Gauss Mark.Detect
Open Source Code	No	The paper mentions using 'Hugging Face repository (Wolf et al., 2020a) to load our models, v LLM (Kwon et al., 2023) for generation, and Py Torch (Paszke et al., 2019) for watermark detection.' This refers to third-party tools used in their experiments, not the release of their own implementation code for Gauss Mark.
Open Datasets	Yes	As has become standard in recent empirical evaluations of watermarking (Kirchenbauer et al., 2023a; Kuditipudi et al., 2023; Lau et al., 2024; Pan et al., 2024), we use the realnewslike split of the C4 dataset (Raffel et al., 2020) as prompts for generation. To evaluate the effect of Gauss Mark on text quality, we employ three benchmarks: Super GLUE (Wang et al., 2019), GSM 8K (Cobbe et al., 2021), and Alpaca Eval 2.0 (Dubois et al., 2024; Li et al., 2023b).
Dataset Splits	Yes	We use the realnewslike split of the C4 dataset (Raffel et al., 2020) as prompts for generation. We use the same 1K prompts for all models and all watermarking keys in order to make the comparison fair.
Hardware Specification	Yes	All of our experiments are run on 40GB NVIDIA A100 GPUs, with each model chosen sufficiently small that a single GPU suffices for both generation and detection.
Software Dependencies	No	The paper mentions using 'Hugging Face repository (Wolf et al., 2020a)', 'v LLM (Kwon et al., 2023)', and 'Py Torch (Paszke et al., 2019)'. However, it does not specify the version numbers for these software dependencies, only citing the papers that introduced them.
Experiment Setup	Yes	Key Experimental Details. The Gauss Mark procedure described in Algorithms 1 and 2 can be used essentially out of the box, requiring careful selection of two key hyperparameters: the variance of the Gaussian noise, σ, and the specific parameter θ. ... In all of our experiments, we used 3 seeds for watermarking the model and generating tokens. The watermarking parameters used to construct the plots in the main body of the papers are summarized in Table 2. ... For each of the language models under consideration, we choose θ to be a single MLP layer (cf. Appendix C).