reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Detection of LLM-Generated Texts via Sequential Hypothesis Testing by Betting

Authors: Can Chen, Jun-Kun Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments were conducted to demonstrate the effectiveness of our method. ... We evaluate the effectiveness of our method through comprehensive experiments. The code and datasets are available via this link: https://github.com/canchen-cc/ online-llm-detection. ... 4. Experiments ... Figure 2 shows the performance of our algorithm with different score functions under Scenario 1 (oracle). Figure 3 shows the empirical results of our algorithm under Scenario 2
Researcher Affiliation	Academia	1Halıcıo glu Data Science Institute, University of California San Diego, La Jolla, USA 2Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, USA. Correspondence to: Can Chen <EMAIL>, Jun-Kun Wang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Online Detection of LLMs via Online Optimization and Betting ... Algorithm 2 Online Newton Step (Hazan et al., 2016) ... Algorithm 3 Online Detection of LLMs via Online Optimization and Betting for the Composite Hypotheses Testing
Open Source Code	Yes	We evaluate the effectiveness of our method through comprehensive experiments. The code and datasets are available via this link: https://github.com/canchen-cc/ online-llm-detection.
Open Datasets	Yes	Specifically, we collect 500 news about Paris 2024 Olympic Games from its official website (Olympics, 2024) ... We sample human-written text xt from a pool of 500 news articles from the XSum dataset (Narayan et al., 2018). We emphasize that we also consider existing datasets from Bao et al. (2023) for the experiments.
Dataset Splits	No	The paper does not provide specific training/validation/test dataset splits. It describes a sequential detection methodology where texts are observed in a streaming fashion, and pre-trained score functions are used. It mentions using 'the first 10 samples from each sequence of xt and yt' for parameter estimation, but this is not a standard dataset split for training/testing a model developed in this paper.
Hardware Specification	No	The paper mentions "Google Cloud Credits" in the Acknowledgements section, but does not provide specific details on the hardware used, such as GPU models, CPU types, or other processor specifications for running experiments.
Software Dependencies	No	The paper mentions specific language models and scoring models used (e.g., GPT-Neo-2.7B, Gemma-2B, T5-3B, GPT-J6B) but does not list general software dependencies with their specific version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the step size γ, we simply follow the related works (Cutkosky & Orabona, 2018; Chugg et al., 2023; Shekhar & Ramdas, 2023) and let 1/γ = 2/(2 ln 3). We consider two scenarios of sequential hypothesis testing in the experiments. ... we use the first 10 samples from each sequence of xt and yt and set dt to be a constant, which is twice the value of maxs 10\|ϕ(xs) ϕ(ys)\|. For estimating ϵ, we obtain scores for 20 texts sampled from the XSum dataset and randomly divide them into two groups, and set ϵ to twice the average absolute difference between the empirical means of these two groups across 1000 random shuffles. ... we repeat 1000 runs and report the average results over these 1000 runs. ... The parameter value dt in Scenario 1 (oracle) is shown in Table 2, and the value for ϵ can be found in Table 1 in the appendix. ... Our method and the baselines require specifying the significance level parameter α. In our experiments, we try 20 evenly spaced values of the significance level parameter α that ranges from 0.005 to 0.1 and report the performance of each one. ... the time budget is T = 500 ... Batch sizes k {25, 50, 100, 250, 500} are considered for the baselines.