LPDetective: Dusting the LLM Chats for Prompt Template Abusers

Authors: Yang Luo, Qingni Shen, Zhonghai Wu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct systematic experiments on three large-scale real-world datasets: Bing Copilot, Wildchat, and Chat Log. The results show that LPDetective can efficiently and accurately detect robot prompt templates in various scenarios, achieving a 7.5% improvement in F1 score compared to the state-of-the-art XLNet method and reducing detection latency by 178 times on the Bing Copilot dataset.
Researcher Affiliation Academia Yang Luo1,2,3 , Qingni Shen1,2,3 and Zhonghai Wu1,2,3 1National Engineering Research Center for Software Engineering, Peking University, Beijing, China 2School of Software and Microelectronics, Peking University, Beijing, China 3PKU-OCTA Laboratory for Blockchain and Privacy Computing, Peking University, Beijing, China EMAIL
Pseudocode Yes Algorithm 1 shows the complete process of regular expression extraction, where n is the size of cluster C, l is the average length of strings, k is the number of regular expression clusters. ... Algorithm 2 shows the optimized matching process.
Open Source Code No The paper does not contain an explicit statement about releasing the code for the described methodology or a link to a code repository. The text mentions "The experimental code was implemented based on Py Torch 2.2.0.", which only describes the implementation basis, not its public availability.
Open Datasets Yes We evaluate the performance of LPDetective on three datasets: Bing Copilot, Wild Chat [Zhao et al., 2024], and Chat Log [Tu et al., 2023].
Dataset Splits Yes We randomly divided each website s dataset into training set (70%), validation set (10%), and test set (20%).
Hardware Specification Yes All experiments were conducted on an Ubuntu 20.04 server equipped with an Intel Xeon 8369B CPU, 96 GB memory, and an NVIDIA V100 GPU.
Software Dependencies Yes The experimental code was implemented based on Py Torch 2.2.0.
Experiment Setup Yes All models used the Adam optimizer, and we searched for the learning rate initial value between 0.0001 and 0.1, the batch size between 16 and 128, and the number of training iterations between 10 and 1000. We selected the hyperparameter combination with the highest F1 value on the validation set as the final setting. The learning rate initial value for all models was 0.001, the batch size was 64, and the number of training iterations was 100.