reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models

Authors: Ya Jiang, Chuxiong Wu, Massieh Kordi Boroujeny, Brian Mark, Kai Zeng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of Stealth Ink, establishing it as an effective solution for LLM watermarking applications. ... 6. Experiments: We compare Stealth Ink with SOTA methods (Yoo et al., 2024; Qu et al., 2024; Fernandez et al., 2023) on stealthiness, detectability, and robustness.
Researcher Affiliation	Academia	1Department of Computer Science, George Mason University, Fairfax, VA, USA 2Wireless Cyber Center, College of Engineering and Computing, George Mason University, Fairfax, VA, USA. Correspondence to: Ya Jiang <EMAIL>.
Pseudocode	Yes	Algorithm 1 shows the process of encoding a multi-bit watermark in Stealth Ink. ... Algorithm 2 in Appendix D.
Open Source Code	No	The paper describes a novel watermarking scheme called Stealth Ink and presents its methodology. However, it does not contain any explicit statement about releasing the source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	For text completion, unless noted otherwise, we use LLAMA2-7B (Touvron et al., 2023) and 500 randomly selected texts from the Real News Like subset of C4 (Raffel et al., 2020)... For the machine translation task, we focus on English-to Romanian translation and employ the Multilingual BART (MBart) model (Liu et al., 2020) on the WMT 14 En-Ro corpus (Bojar et al., 2014)... For the text summarization task, we employ the BART-large model (Liu et al., 2020)... we use the test set from the CNN-DM corpus (Hermann et al., 2015).
Dataset Splits	Yes	For the machine translation task, we utilize the WMT 16 English (En) to Romanian (Ro) dataset, comprising 1,999 examples in the test set. ... In the text summarization task, we use the test set from the CNN-DM corpus (Hermann et al., 2015), consisting of 11,490 examples on BART-large (Liu et al., 2020).
Hardware Specification	Yes	All experiments are conducted on the Nvidia A100 GPU with 40 GB of memory.
Software Dependencies	No	The paper mentions models like LLAMA2-7B, BART-large, and MBart, and uses SHA-256 as a pseudorandom function. However, it does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used to implement the experiments.
Experiment Setup	Yes	For text completion, unless noted otherwise, we use LLAMA2-7B (Touvron et al., 2023) and 500 randomly selected texts from the Real News Like subset of C4 (Raffel et al., 2020), trimming a fixed number of tokens from the start as prompts (see Appendix H). ... The default temperature is 1.0 and the texture key length h is 3. The multinomial sampling strategy is applied during text generation. ... Stealth Ink achieves an AUC of 0.98 and a bit accuracy of 0.92 when embedding 24-bit messages in 300 tokens.