Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

Authors: Tianchun Wang, Yuanzhou Chen, Zichuan Liu, Zhanwen Chen, Haifeng Chen, Xiang Zhang, Wei Cheng

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-8 7B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 95.0% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%.
Researcher Affiliation Collaboration 1The Pennsylvania State University, 2University of California, Los Angeles, 3Nanjing University, 4University of Virginia, 5NEC Laboratories America EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods through textual explanations and mathematical formulations (e.g., equations 1-5) and figures (Figure 1 and 2), but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/xztcwang/evad_detection
Open Datasets Yes We evaluate the detection evasion capability of HUMPA attacked by humanized SLMs on the same dataset, including Open Web Text (Gokaslan & Cohen, 2019), Writing Prompts (Fan et al., 2018) and Pub Med QA (Jin et al., 2019). For cross-domain evaluation, detection evasion is performed using humanized SLMs on a different dataset. The experiments are conducted on the cross-discipline corpus GPABench2 (Liu et al., 2023), a dataset containing titles and abstracts of scientific writing across Computer Science (CS), Physics (PHX), and Humanities and Social Sciences (HSS). The cross-language evaluation is conducted on WMT-2016 (Bojar et al., 2016)...
Dataset Splits Yes We divided the dataset into evaluation and training sets. For the Open Web Text, Writing Prompts, Pub Med QA, CS, and PHX datasets, we randomly choose 10k training samples each following the setup in (Nicks et al., 2024). For the WMT-2016 dataset, all English samples were used as the training set. We randomly sample 500 examples of each dataset as human-written texts. For cross-discipline evaluation, we randomly select 200 human-written texts for each dataset. For cross-language evaluation, we sample 150 human-written texts.
Hardware Specification Yes We conduct the experiments on a server with 4 NVIDIA A100 GPUs, each one with 80GB RAM.
Software Dependencies No The paper mentions software components like Low-Rank Adaptation (Lo RA) and Adam W, but does not provide specific version numbers for these or any other key software dependencies.
Experiment Setup Yes During the DPO fine-tuning phase, we apply Low-Rank Adaptation (Lo RA) to fine-tuning the SLM, setting the batch size to 8, the learning rate to 2e 4, and the optimizer to Adam W. We choose the attack ratio α from the set {0.1, 0.2, ..., 1.0} to balance the generation utility and the detection evasion performance. We employ the temperature of 1.0 across all the experiments, which is the same setting in (Nicks et al., 2024). To compare the fine-tuning time, we set the DPO batch size to 8 and epoch to 5.