How efficient is LLM-generated code? A rigorous & high-standard benchmark
Authors: Ruizhong Qiu, Weiliang Zeng, James Ezick, Christopher Lott, Hanghang Tong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization. |
| Researcher Affiliation | Collaboration | University of Illinois Urbana Champaign Qualcomm AI Research EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Numerically stable c effi@k |
| Open Source Code | Yes | Our benchmark is publicly available at https://github.com/q-rz/enamel. |
| Open Datasets | Yes | We carefully select 142 problems out of the 164 problems in Human Eval (Chen et al., 2021) and Human Eval+ (Liu et al., 2023a), excluding trivial problems with Θ(1) time complexity. |
| Dataset Splits | Yes | For each problem i, each level l = 0, 1, . . . , L has Ml test cases. If the output of the code does not match the expected output in any test case or does not pass level 0, we will not count it into the pass@k metric. If the code passes level 0 but exceeds the time limit in some level l ≥ 1, we will still count it into the pass@k metric but will skip the remaining levels (i.e., we assume that it will also exceed the time limit for the remaining levels because the input scale increases with the level l). Finally, we compute its efficiency score according to §2.2. ... We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4. |
| Hardware Specification | Yes | For other open-source models, we use temperature 0.8 and top p 0.95 for sampling on a server with 8 NVIDIA A100 80GB GPUs. ... We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12). |
| Software Dependencies | No | We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12). The paper only lists Python 3.10.12 without any specific versioned libraries or solvers, which according to the criteria is not enough for a 'Yes'. |
| Experiment Setup | Yes | We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4. To minimize server workload fluctuations, we run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12). |