CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Authors: Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs. We also introduce the Code Halu Eval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. The experimental results are presented in Table 2
Researcher Affiliation Collaboration 1Hong Kong Baptist University 2University of California, Santa Barbara 3Mila Qu ebec AI Institute 4Universit e de Montr eal 5University of California, Berkeley 6Alibaba Group 7The University of Tokyo 8University of Alberta
Pseudocode Yes Algorithm 1: Code Halu Algorithm
Open Source Code Yes The Code Halu benchmark and code are publicly available at https://github.com/yuchen814/Code Halu.
Open Datasets Yes The Code Halu benchmark and code are publicly available at https://github.com/yuchen814/Code Halu. We develop the Code Halu Eval benchmark based on the APPS testing set, following a structured process of Validation Identification-Construction, as shown in Figure 4.
Dataset Splits Yes We develop the Code Halu Eval benchmark based on the APPS testing set, following a structured process of Validation Identification-Construction, as shown in Figure 4. Code Halu Eval encompasses eight types of code hallucinations as illustrated in Figure 1, covering 699 distinct tasks and corresponding to 8,883 samples.
Hardware Specification Yes The experimental evaluation is conducted using API calls or 8 NVIDIA A6000 GPUs.
Software Dependencies No No specific software dependencies with version numbers are mentioned, beyond the primary investigation language being Python. According to the TIOBE Index2, a metric of programming language popularity, we primarily investigate code hallucinations within the Python.
Experiment Setup Yes We integrate resource (time and memory) constraints into the code generation instructions. When selecting the threshold k, we consider both the minimum number of samples required to detect code hallucination effects in the Code Halu Eval benchmark and the inference costs associated with evaluating various LLMs.