reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Watermark for Order-Agnostic Language Models

Authors: Ruibo Chen, Yihan Wu, Yanshuo Chen, Chenxi Liu, Junfeng Guo, Heng Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluations on order-agnostic LMs, such as Protein MPNN and CMLM, demonstrate PATTERN-MARK s enhanced detection efficiency, generation quality, and robustness, positioning it as a superior watermarking technique for orderagnostic LMs. Through comprehensive experiments on two popular order-agnostic LMs, Protein MPNN (Dauparas et al., 2022) and CMLM (Ghazvininejad et al., 2019), we demonstrate the superiority of PATTERN-MARK in terms of detection efficiency, generation quality, and robustness compared to baseline methods. Our experimental section consists of three parts. In the first part, we compare the detection efficiency of PATTERN-MARK with the baseline. In the second part, we evaluate the generation quality of PATTERN-MARK. In the third part, we assess the robustness of the PATTERN-MARK when subjected to random token modification and paraphrasing attacks.
Researcher Affiliation	Academia	Department of Computer Science University of Maryland, College Park, MD, USA EMAIL
Pseudocode	Yes	Algorithm 1 PATTERN-MARK generator Algorithm 2 PATTERN-MARK detector Algorithm 3 Compute pattern occurrence probability under the null hypothesis Algorithm 4 Compute pattern occurrence probability under the null hypothesis
Open Source Code	No	The paper does not explicitly provide an unambiguous statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	For CMLM, we use the data collected from news crawl1 news.2015.ro.shuffled whose length is larger than 128 to encourage longer generation. The filtered dataset has 1003 samples. 1https://data.statmt.org/news-crawl/ For Protein MPNN, we use the protein features from PCSB Protein Data Bank2, which is published from 2020 Jan. 1st to 2023 Dec. 31st. We limit the number of polymer residues per deposited model to between 400 and 500. The filtered dataset has 747 samples. 2https://www.rcsb.org/
Dataset Splits	No	The paper mentions the total number of samples used for evaluation (e.g., "around 800 generated protein sequences" and "around 1000 generated sequences") but does not provide specific train/validation/test splits for these or any underlying datasets for reproducing the data partitioning for their experiments.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions models like Protein MPNN (Dauparas et al., 2022) and CMLM (Ghazvininejad et al., 2019) and a specific checkpoint "v48_020 model checkpoint for Protein MPNN". However, it does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) required to replicate their experimental setup for PATTERN-MARK itself.
Experiment Setup	Yes	We select the key set K = {k1, k2}, the Markov-chain transition matrix A = [[0, 1], [1, 0]], and the initial distribution Q = [0.5, 0.5]. The key patterns are defined as T = {k1k2k1 . . . , k2k1k2 . . .}, where k1 and k2 appear alternately, T Km. Under this configuration, the probability PT,n can be calculated using Algorithm 4, which optimizes the process described in Algorithm 3. We select δ {0.5, 0.75, 1.0, 1.25, 1.5} for protein generation task, and δ {1, 2, 3, 4, 5} for machine translation task.