reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LPDetective: Dusting the LLM Chats for Prompt Template Abusers

Authors: Yang Luo, Qingni Shen, Zhonghai Wu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct systematic experiments on three large-scale real-world datasets: Bing Copilot, Wildchat, and Chat Log. The results show that LPDetective can efficiently and accurately detect robot prompt templates in various scenarios, achieving a 7.5% improvement in F1 score compared to the state-of-the-art XLNet method and reducing detection latency by 178 times on the Bing Copilot dataset.
Researcher Affiliation	Academia	Yang Luo1,2,3 , Qingni Shen1,2,3 and Zhonghai Wu1,2,3 1National Engineering Research Center for Software Engineering, Peking University, Beijing, China 2School of Software and Microelectronics, Peking University, Beijing, China 3PKU-OCTA Laboratory for Blockchain and Privacy Computing, Peking University, Beijing, China EMAIL
Pseudocode	Yes	Algorithm 1 shows the complete process of regular expression extraction, where n is the size of cluster C, l is the average length of strings, k is the number of regular expression clusters. ... Algorithm 2 shows the optimized matching process.
Open Source Code	No	The paper does not contain an explicit statement about releasing the code for the described methodology or a link to a code repository. The text mentions "The experimental code was implemented based on Py Torch 2.2.0.", which only describes the implementation basis, not its public availability.
Open Datasets	Yes	We evaluate the performance of LPDetective on three datasets: Bing Copilot, Wild Chat [Zhao et al., 2024], and Chat Log [Tu et al., 2023].
Dataset Splits	Yes	We randomly divided each website s dataset into training set (70%), validation set (10%), and test set (20%).
Hardware Specification	Yes	All experiments were conducted on an Ubuntu 20.04 server equipped with an Intel Xeon 8369B CPU, 96 GB memory, and an NVIDIA V100 GPU.
Software Dependencies	Yes	The experimental code was implemented based on Py Torch 2.2.0.
Experiment Setup	Yes	All models used the Adam optimizer, and we searched for the learning rate initial value between 0.0001 and 0.1, the batch size between 16 and 128, and the number of training iterations between 10 and 1000. We selected the hyperparameter combination with the highest F1 value on the validation set as the final setting. The learning rate initial value for all models was 0.001, the batch size was 64, and the number of training iterations was 100.