reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Authors: Tong Ye, Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate a significant improvement over existing SOTA synthetic content detectors, delivering notable gains in both performance and robustness on the APPS and MBPP benchmarks. Extensive experiments demonstrate the effectiveness of our method, showing significant improvements in both accuracy and robustness compared to existing methods.
Researcher Affiliation	Collaboration	1Zhejiang University 2Stony Brook University 3Anytime.AI
Pseudocode	Yes	Algorithm 1: Zero-shot Synthetic Code Detection
Open Source Code	No	The paper does not contain an explicit statement that the authors are releasing their code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets	Yes	Due to the lack of existing benchmarks for evaluating synthetic code detectors, we developed two Python-based benchmarks using APPS (Hendrycks et al. 2021) and MBPP (Austin et al. 2021). To assess the generalizability of our method to different programming languages, we construct an additional C++ benchmark using the Code Contest dataset (Li et al. 2022b). For the Sim CSE training, we collect thousands of code snippets from publicly available code-related datasets as our training data.
Dataset Splits	No	The paper mentions developing benchmarks using APPS and MBPP, and collecting code snippets for training. However, it does not explicitly provide specific details about the training, validation, or test splits (e.g., percentages, sample counts, or explicit split files) used for the experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running the experiments.
Software Dependencies	No	The paper mentions various models and tools like 'Graph Code BERT', 'Sim CSE', 'Code Llama', 'Star Chat', 'GPT-3.5-Turbo', 'GPT-4', but it does not specify version numbers for these software components or for general programming environments (e.g., Python version, PyTorch version).
Experiment Setup	Yes	For code rewriting, we utilize nucleus sampling with a top-p of 0.95 and a temperature of 0.8. Here, τ is a temperature hyperparameter set to 0.1. Our experiments show that using just 4 rewrites is sufficient to achieve excellent detection performance. We conducted experiments on the APPS and MBPP benchmarks, varying the generator temperature from [0.2, 0.4, 0.8] while keeping the rewriting temperature fixed at 0.8.