reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

Authors: Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, Dawn Song

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs. We also introduce the Code Halu Eval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. The experimental results are presented in Table 2
Researcher Affiliation	Collaboration	1Hong Kong Baptist University 2University of California, Santa Barbara 3Mila Qu ebec AI Institute 4Universit e de Montr eal 5University of California, Berkeley 6Alibaba Group 7The University of Tokyo 8University of Alberta
Pseudocode	Yes	Algorithm 1: Code Halu Algorithm
Open Source Code	Yes	The Code Halu benchmark and code are publicly available at https://github.com/yuchen814/Code Halu.
Open Datasets	Yes	The Code Halu benchmark and code are publicly available at https://github.com/yuchen814/Code Halu. We develop the Code Halu Eval benchmark based on the APPS testing set, following a structured process of Validation Identification-Construction, as shown in Figure 4.
Dataset Splits	Yes	We develop the Code Halu Eval benchmark based on the APPS testing set, following a structured process of Validation Identification-Construction, as shown in Figure 4. Code Halu Eval encompasses eight types of code hallucinations as illustrated in Figure 1, covering 699 distinct tasks and corresponding to 8,883 samples.
Hardware Specification	Yes	The experimental evaluation is conducted using API calls or 8 NVIDIA A6000 GPUs.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned, beyond the primary investigation language being Python. According to the TIOBE Index2, a metric of programming language popularity, we primarily investigate code hallucinations within the Python.
Experiment Setup	Yes	We integrate resource (time and memory) constraints into the code generation instructions. When selecting the threshold k, we consider both the minimum number of samples required to detect code hallucination effects in the Code Halu Eval benchmark and the inference costs associated with evaluating various LLMs.