DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Authors: Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on 10+ LLMs indicate that LLMs are generally good at computation, with an average of 82.44% Pass@1, while falling short on cryptography and system domains, with an average of 33.08% Pass@1 and 37.50% Pass@1, respectively. The performance gap among domains can be as much as 68%+ from Llama-2-13B-Chat model, which is observed to have the largest performance variance compared with other LLMs.
Researcher Affiliation Academia 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3The Hong Kong University of Science and Technology, Hong Kong, China
Pseudocode No The paper describes a construction pipeline with three steps (Domain Repository Collection, Test-Method Matching & Selection, Instruction Generation) and illustrates it with a diagram (Figure 2), but it does not present these steps in a structured pseudocode or algorithm block format.
Open Source Code Yes Code https://github.com/domaineval
Open Datasets Yes The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. Code https://github.com/domaineval
Dataset Splits Yes Each subject consists of three components: instruction for LLM evaluation, reference solution, and a series of test cases, as illustrated in Figure 3. ... For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95.
Hardware Specification No The paper lists the LLM models evaluated (e.g., GPT-3.5-turbo, Qwen2, Llama2) and mentions using 'torch.bfloat16 when loading LLMs,' but it does not specify the underlying hardware (CPU, GPU models, memory) on which these evaluations were performed.
Software Dependencies No The paper mentions using 'torch.bfloat16 when loading LLMs' but does not specify the version of PyTorch or any other software dependencies with their version numbers.
Experiment Setup Yes Our evaluation uses the unbiased version of Pass@k (Chen et al. 2021) to accurately assess the functional correctness of code snippets generated by LLMs. Following prior work (Zhuo et al. 2024), we report Pass@1 and Pass@5 for the experiment in zero-shot setting and use macro-average as scores. For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95.