BiMark: Unbiased Multilayer Watermarking for Large Language Models
Authors: Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments to evaluate Bi Mark s effectiveness across three key dimensions: message embedding capacity, text quality preservation, and an ablation study of the multilayer mechanism. |
| Researcher Affiliation | Academia | 1 Griffith University, Brisbane 2 RMIT University, Melbourne 3 University of Technology Sydney, Sydney. Correspondence to: Leo Yu Zhang <EMAIL>, Shirui Pan <EMAIL>. |
| Pseudocode | Yes | C. Algorithms Alg. 1 summarizes the watermarked text generation process for message embedding discussed in Sec. 4.2. Alg. 2 summarizes the watermark detection process for message extraction discussed in Sec. 4.2. |
| Open Source Code | Yes | The code is available at: https: //github.com/Kx-Feng/Bi Mark.git. |
| Open Datasets | Yes | C4-Real Newslike (Raffel et al., 2020) dataset is used as prompts. For text summarization, BART-large (Lewis et al., 2019) is employed on the CNN/Daily Mail dataset (See et al., 2017)... For machine translation, MBart (Lewis et al., 2019) is employed on the WMT 16 En-Ro subset (Bojar et al., 2016)... |
| Dataset Splits | No | The paper mentions using datasets like C4-Real Newslike, CNN/Daily Mail, and WMT 16 En-Ro for experiments. However, it does not provide specific details regarding train, validation, or test splits for these datasets within the context of its own experimental methodology or for reproducibility. |
| Hardware Specification | No | The paper discusses computational cost and efficiency, mentioning token generation times for different batch sizes. However, it does not explicitly specify the models or types of hardware (e.g., GPUs, CPUs) used to run the experiments. |
| Software Dependencies | No | The paper mentions several models and tools such as Llama3-8B, Gemma-9B, BERT, Word Net, BART-large, and MBart. However, it does not provide specific version numbers for any of these software components, libraries, or programming languages, which are necessary for replication. |
| Experiment Setup | Yes | For experiments of message embedding capacity, the Llama3-8B model (AI@Meta, 2024) is used for text generation with temperature 1.0 and top-50 sampling. For MPAC and Soft Red List, the proportion γ of green lists is 0.5 for a balance between detectability and text quality... For Synth ID, the number of tournaments is 30, as recommended in the default setting. For Bi Mark, the base scaling factor δ is 1.0, and the number of layers d is 10... |