reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Kernel Relative Test for Machine-generated Text Detection

Authors: Yiliao Song, Zhenqiao Yuan, Shuhai Zhang, Zhen Fang, Jun Yu, Feng Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the superior performance of our method, compared to state-of-the-art non-parametric and parametric detectors.
Researcher Affiliation	Academia	School of Computer and Mathematical Sciences, The University of Adelaide, Adelaide, AU1 School of Computing and Information Systems, University of Melbourne, Melbourne, AU2 School of Software Engineering, South China University of Technology, Guangzhou, CN3 Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, AU4 School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, CN5
Pseudocode	Yes	Algorithm 1 Relative Test MGT Detection
Open Source Code	Yes	The code and demo are available: https://github.com/x Learn-AU/R-Detect.
Open Datasets	Yes	We design our experiments on data from five benchmarks: HC3 (Guo et al., 2023), Truthful QA (TQA) (He et al., 2023; Lin et al., 2022), RAID (Dugan et al., 2024), and Detect RL (Wu et al., 2024).
Dataset Splits	Yes	In the default setting, we randomly take 512 tokens and repeat the experiments 10 × 10 times given a specific experimental design. During each round of detection in section 4.2 and section 4.3, we first shuffle the HC3 dataset and select the first 512 tokens from HWTs and the first 512 tokens from MGTs as the text to be tested (the token number will be 256 in the token-256 experiments). The default reference data will be the rest of the data.
Hardware Specification	Yes	We conduct our experiments using Python 3.9 and Pytorch 2.0 on a server with Intel Core i9 14900K and RTX 4090.
Software Dependencies	Yes	We conduct our experiments using Python 3.9 and Pytorch 2.0 on a server with Intel Core i9 14900K and RTX 4090.
Experiment Setup	Yes	In Algorithm 3, we use Adam optimizer (Kingma & Ba, 2015) to optimize the deep kernel parameters, we set λ to 10⁻⁸ and batch size to 200, and the learning rate to 0.00005 in all experiments. The default threshold of the hypothesis test both two-sample test or relative test is α = 0.05 to determine whether to reject or accept the null hypothesis.