Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting
Authors: Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate a significant improvement over existing SOTA synthetic content detectors, delivering notable gains in both performance and robustness on the APPS and MBPP benchmarks. Extensive experiments demonstrate the effectiveness of our method, showing significant improvements in both accuracy and robustness compared to existing methods. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Stony Brook University 3Anytime.AI |
| Pseudocode | Yes | Algorithm 1: Zero-shot Synthetic Code Detection |
| Open Source Code | No | The paper does not contain an explicit statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Due to the lack of existing benchmarks for evaluating synthetic code detectors, we developed two Python-based benchmarks using APPS (Hendrycks et al. 2021) and MBPP (Austin et al. 2021). To assess the generalizability of our method to different programming languages, we construct an additional C++ benchmark using the Code Contest dataset (Li et al. 2022b). For the Sim CSE training, we collect thousands of code snippets from publicly available code-related datasets as our training data. |
| Dataset Splits | No | The paper mentions developing benchmarks using APPS and MBPP, and collecting code snippets for training. However, it does not explicitly provide specific details about the training, validation, or test splits (e.g., percentages, sample counts, or explicit split files) used for the experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions various models and tools like 'Graph Code BERT', 'Sim CSE', 'Code Llama', 'Star Chat', 'GPT-3.5-Turbo', 'GPT-4', but it does not specify version numbers for these software components or for general programming environments (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | For code rewriting, we utilize nucleus sampling with a top-p of 0.95 and a temperature of 0.8. Here, τ is a temperature hyperparameter set to 0.1. Our experiments show that using just 4 rewrites is sufficient to achieve excellent detection performance. We conducted experiments on the APPS and MBPP benchmarks, varying the generator temperature from [0.2, 0.4, 0.8] while keeping the rewriting temperature fixed at 0.8. |