reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Authors: Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on 10+ LLMs indicate that LLMs are generally good at computation, with an average of 82.44% Pass@1, while falling short on cryptography and system domains, with an average of 33.08% Pass@1 and 37.50% Pass@1, respectively. The performance gap among domains can be as much as 68%+ from Llama-2-13B-Chat model, which is observed to have the largest performance variance compared with other LLMs.
Researcher Affiliation	Academia	1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3The Hong Kong University of Science and Technology, Hong Kong, China
Pseudocode	No	The paper describes a construction pipeline with three steps (Domain Repository Collection, Test-Method Matching & Selection, Instruction Generation) and illustrates it with a diagram (Figure 2), but it does not present these steps in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Code https://github.com/domaineval
Open Datasets	Yes	The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. Code https://github.com/domaineval
Dataset Splits	Yes	Each subject consists of three components: instruction for LLM evaluation, reference solution, and a series of test cases, as illustrated in Figure 3. ... For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95.
Hardware Specification	No	The paper lists the LLM models evaluated (e.g., GPT-3.5-turbo, Qwen2, Llama2) and mentions using 'torch.bfloat16 when loading LLMs,' but it does not specify the underlying hardware (CPU, GPU models, memory) on which these evaluations were performed.
Software Dependencies	No	The paper mentions using 'torch.bfloat16 when loading LLMs' but does not specify the version of PyTorch or any other software dependencies with their version numbers.
Experiment Setup	Yes	Our evaluation uses the unbiased version of Pass@k (Chen et al. 2021) to accurately assess the functional correctness of code snippets generated by LLMs. Following prior work (Zhuo et al. 2024), we report Pass@1 and Pass@5 for the experiment in zero-shot setting and use macro-average as scores. For Pass@1 metric, we use greedy decoding, i.e.set temperature to 0.0. For Pass@5 metric, we opt for the minimum sample size N = 5 and maintain temperature at 0.2 and top-p at 0.95.