DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
Authors: Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on 10+ LLMs indicate that LLMs are generally good at computation, with an average of 82.44% Pass@1, while falling short on cryptography and system domains, with an average of 33.08% Pass@1 and 37.50% Pass@1, respectively. The performance gap among domains can be as much as 68%+ from Llama-2-13B-Chat model, which is observed to have the largest performance variance compared with other LLMs. |
| Researcher Affiliation | Academia | 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3The Hong Kong University of Science and Technology, Hong Kong, China |
| Pseudocode | No | The paper describes a construction pipeline with three steps (Domain Repository Collection, Test-Method Matching & Selection, Instruction Generation) and illustrates it with a diagram (Figure 2), but it does not present these steps in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Code https://github.com/domaineval |
| Open Datasets | Yes | The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. Code https://github.com/domaineval |
| Dataset Splits | Yes | Each subject consists of three components: instruction for LLM evaluation, reference solution, and a series of test cases, as illustrated in Figure 3. ... For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95. |
| Hardware Specification | No | The paper lists the LLM models evaluated (e.g., GPT-3.5-turbo, Qwen2, Llama2) and mentions using 'torch.bfloat16 when loading LLMs,' but it does not specify the underlying hardware (CPU, GPU models, memory) on which these evaluations were performed. |
| Software Dependencies | No | The paper mentions using 'torch.bfloat16 when loading LLMs' but does not specify the version of PyTorch or any other software dependencies with their version numbers. |
| Experiment Setup | Yes | Our evaluation uses the unbiased version of Pass@k (Chen et al. 2021) to accurately assess the functional correctness of code snippets generated by LLMs. Following prior work (Zhuo et al. 2024), we report Pass@1 and Pass@5 for the experiment in zero-shot setting and use macro-average as scores. For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95. |