reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PaLD: Detection of Text Partially Written by Large Language Models

Authors: Eric Lei, Hsiang Hsu, Chun-Fu Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we demonstrate the effectiveness of Pa LD compared to baseline methods that build on existing LLM text detectors. In Section 4, we empirically illustrate that Pa LD-PE and Pa LD-TI outperform existing detection methods on two language datasets: Writing Prompts (Fan et al., 2018) and Yelp Reviews (Yelp, 2014).
Researcher Affiliation	Collaboration	Eric Lei12 , Hsiang Hsu2, Chun-Fu (Richard) Chen2 1University of Pennsylvania, 2JPMorgan Chase Global Technology Applied Research EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Greedy algorithm, Pa LD-TI. Initialize S = { arg max e {1,...,n} fx({e})} Initialize A = {1, . . . , n} \ S while fx(S) increases do e = arg maxe A fx(S e) fx(S) S S e A A \ e end while
Open Source Code	Yes	Code to reproduce our experiments can be accessed at https://github.com/jpmorganchase/pald.
Open Datasets	Yes	We evaluate our methods on the Writing Prompts (WP) (Fan et al., 2018) and Yelp Reviews (Yelp) (Yelp, 2014) datasets which are typically used to benchmark LLM text detection.
Dataset Splits	Yes	In total, for each dataset, we generate 3,600 and 300 mixed texts for training and test splits, respectively. For the training split, the LLM target fractions are ranged from 0.1, 0.2, . . . , to 0.9; while the LLM target fractions are set to 0.25, 0.5, 0.75 for the test split, and the amount of data at each fraction are similar as a balanced dataset.
Hardware Specification	Yes	On a single A10 GPU, the exact solver takes 30s for a 10-segment text on average, whereas greedy takes 2.1s.
Software Dependencies	No	The paper mentions software like RoBERTa, GPT-4o, and Claude-3.5-Sonnet, but does not provide specific version numbers for these or any other libraries or frameworks used in the implementation.
Experiment Setup	Yes	We use Logit Norm with temperature τ = 0.005, and train the RoBERTa model on the training split for the respective datasets. For the posterior, we choose the prior P(δ) to be the Beta(2, 2) distribution. During the inference stage, we sample 5000 samples, discarding the first 1000 due to burn-in, using Metropolis-Hastings (Gelman et al., 2004) with a proposal distribution as the truncated normal centered at the previous sample, truncated to [0, 1].