reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training-free LLM-generated Text Detection by Mining Token Probability Sequences

Authors: Yihuai Xu, Yongwei Wang, YIFEI BI, Huangsen Cao, Zhouhan Lin, Yu Zhao, Fei Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on six datasets involving cross-domain, cross-model, and cross-lingual detection scenarios, under both white-box and black-box settings, demonstrated that our method consistently achieves state-of-the-art performance.
Researcher Affiliation	Academia	1Zhejiang University 2Georgia Institute of Technology 3Shanghai Jiao Tong University 4Zhejiang Gongshang University
Pseudocode	No	The paper describes the methodology with mathematical formulations and a framework diagram (Figure 2), and outlines the detection process in three steps. However, it does not present a dedicated section or block formatted as pseudocode or a clear algorithm.
Open Source Code	Yes	1The code and data are released at https://github.com/Trust Media-zju/Lastde_ Detector.
Open Datasets	Yes	The experiments conducted involved 6 distinct datasets, covering a range of languages and topics. Adhering to the setups of Fast-Detect GPT and DNA-GPT, we report the main detection results on 4 datasets: XSum (Narayan et al., 2018) (BBC News documents), SQu AD (Rajpurkar et al., 2016; 2018) (Wikipedia-based Q&A context), Writing Prompts (Fan et al., 2018) (for story generation),and Reddit ELI5 (Fan et al., 2019) (Q&A data restricted to the topics of biology, physics, chemistry, economics, law, and technique).
Dataset Splits	Yes	We prefer the latter approach and have fitted logistic regression models on datasets (including Xsum, Writing Prompts, Reddit) generated by two closed-source models (GPT-4-Turbo, GPT-4o) and one open-source model (OPT-13B), reporting metrics on the test set (test size=0.2).
Hardware Specification	Yes	Our experimental setup consists of two RTX 3090 GPUs (2 24GB).
Software Dependencies	No	The paper lists various LLMs used as source and proxy models, with references to their technical reports or versions (e.g., GPT-4 (Open AI, 2024b), Gemma (Team et al., 2024), GPT-J (Wang & Komatsuzaki, 2021)). However, it does not provide specific version numbers for general software dependencies such as programming languages (e.g., Python) or libraries (e.g., PyTorch, TensorFlow, Hugging Face Transformers) used for implementing their methodology.
Experiment Setup	Yes	Furthermore, for Lastde, the 3 hyperparameters are set to default values of s = 3, ε = 10 n, τ = 5, where n is the number of tokens in the text. ... For Lastde++, the default settings are s = 4, ε = 8 n, τ = 15.