reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BiMark: Unbiased Multilayer Watermarking for Large Language Models

Authors: Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments to evaluate Bi Mark s effectiveness across three key dimensions: message embedding capacity, text quality preservation, and an ablation study of the multilayer mechanism.
Researcher Affiliation	Academia	1 Griffith University, Brisbane 2 RMIT University, Melbourne 3 University of Technology Sydney, Sydney. Correspondence to: Leo Yu Zhang <EMAIL>, Shirui Pan <EMAIL>.
Pseudocode	Yes	C. Algorithms Alg. 1 summarizes the watermarked text generation process for message embedding discussed in Sec. 4.2. Alg. 2 summarizes the watermark detection process for message extraction discussed in Sec. 4.2.
Open Source Code	Yes	The code is available at: https: //github.com/Kx-Feng/Bi Mark.git.
Open Datasets	Yes	C4-Real Newslike (Raffel et al., 2020) dataset is used as prompts. For text summarization, BART-large (Lewis et al., 2019) is employed on the CNN/Daily Mail dataset (See et al., 2017)... For machine translation, MBart (Lewis et al., 2019) is employed on the WMT 16 En-Ro subset (Bojar et al., 2016)...
Dataset Splits	No	The paper mentions using datasets like C4-Real Newslike, CNN/Daily Mail, and WMT 16 En-Ro for experiments. However, it does not provide specific details regarding train, validation, or test splits for these datasets within the context of its own experimental methodology or for reproducibility.
Hardware Specification	No	The paper discusses computational cost and efficiency, mentioning token generation times for different batch sizes. However, it does not explicitly specify the models or types of hardware (e.g., GPUs, CPUs) used to run the experiments.
Software Dependencies	No	The paper mentions several models and tools such as Llama3-8B, Gemma-9B, BERT, Word Net, BART-large, and MBart. However, it does not provide specific version numbers for any of these software components, libraries, or programming languages, which are necessary for replication.
Experiment Setup	Yes	For experiments of message embedding capacity, the Llama3-8B model (AI@Meta, 2024) is used for text generation with temperature 1.0 and top-50 sampling. For MPAC and Soft Red List, the proportion γ of green lists is 0.5 for a balance between detectability and text quality... For Synth ID, the number of tournaments is 30, as recommended in the default setting. For Bi Mark, the base scaling factor δ is 1.0, and the number of layers d is 10...