reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers

Authors: Vaden Masrani, Mohammad Akbari, David Ming Xuan Yue, Ahmad Rezaei, Yong Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed passthrough layers on a wide range of downstream tasks, and show experimentally our watermarking method achieves a near-perfect watermark extraction accuracy and false-positive rate in most cases without damaging original model performance.
Researcher Affiliation	Industry	Huawei Technologies Canada Co. Ltd. EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and descriptive text, but no distinct pseudocode or algorithm blocks are explicitly provided or labeled.
Open Source Code	Yes	Code https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=58b799a0-5cfc-4c2e-8b9b440bb2315264
Open Datasets	Yes	Following (Gu et al. 2023), we validate our method across 4 classification tasks and 7 datasets: SST2 (Socher et al. 2013), IMDB (Maas et al. 2011), SNLI (Bowman et al. 2015), MNLI (Williams, Nangia, and Bowman 2018), AGNews (Zhang, Zhao, and Le Cun 2015), News Group (NG) (Lang 1995), and PAWS (Zhang, Baldridge, and He 2019), covering sentiment, entailment, and paraphrase detection, and topic classification tasks.
Dataset Splits	No	The paper mentions fine-tuning for a certain number of epochs and using a pruning ratio, but does not provide specific train/test/validation dataset splits (e.g., exact percentages or sample counts) for reproducibility in the main text. It states, "Hyperparameter settings for each stage and additional details about how metrics are calculated are given in the Appendix (Masrani et al. 2024)", suggesting these details might be elsewhere.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only implies that computational resources were used by mentioning 'Wallclock times are reported in Table 1'.
Software Dependencies	No	The paper mentions using publicly available PLMs from Hugging Face and specific models like BERT-based-uncased, GPT-2, and Llama2-7B. However, it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or Hugging Face Transformers library versions).
Experiment Setup	Yes	We add 1 passthrough layer at position {3,5,8} (PTL-358) to the pretrained BERT, and train it for 10K steps. All layers except the passthrough layers, head, and last layer are frozen. [...] we use GPT-2 with 124M parameters. [...] We add passthrough layers at positions {1}, {1,4,7}, and {1,3,5,7,9}, and train for 100k steps on the Open Web Text. [...] We fine-tune BERT described in the Classification Tasks Section for 10 epochs over 5 downstream tasks. [...] with a pruning ratio of 50% [...] followed by a fine-tuning round for 1 epoch