How efficient is LLM-generated code? A rigorous & high-standard benchmark

Authors: Ruizhong Qiu, Weiliang Zeng, James Ezick, Christopher Lott, Hanghang Tong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization.
Researcher Affiliation Collaboration University of Illinois Urbana Champaign Qualcomm AI Research EMAIL EMAIL
Pseudocode Yes Algorithm 1 Numerically stable c effi@k
Open Source Code Yes Our benchmark is publicly available at https://github.com/q-rz/enamel.
Open Datasets Yes We carefully select 142 problems out of the 164 problems in Human Eval (Chen et al., 2021) and Human Eval+ (Liu et al., 2023a), excluding trivial problems with Θ(1) time complexity.
Dataset Splits Yes For each problem i, each level l = 0, 1, . . . , L has Ml test cases. If the output of the code does not match the expected output in any test case or does not pass level 0, we will not count it into the pass@k metric. If the code passes level 0 but exceeds the time limit in some level l ≥ 1, we will still count it into the pass@k metric but will skip the remaining levels (i.e., we assume that it will also exceed the time limit for the remaining levels because the input scale increases with the level l). Finally, we compute its efficiency score according to §2.2. ... We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4.
Hardware Specification Yes For other open-source models, we use temperature 0.8 and top p 0.95 for sampling on a server with 8 NVIDIA A100 80GB GPUs. ... We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12).
Software Dependencies No We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12). The paper only lists Python 3.10.12 without any specific versioned libraries or solvers, which according to the criteria is not enough for a 'Yes'.
Experiment Setup Yes We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4. To minimize server workload fluctuations, we run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12).