reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

POLO: An LLM-Powered Project-Level Code Performance Optimization Framework

Authors: Jiameng Bai, Ruoyi Xu, Sai Wu, Dingyu Yang, Junbo Zhao, Gang Chen

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on open-source and proprietary projects. The results demonstrate that POLO accurately identifies performance bottlenecks and successfully applies optimizations. Under the O3 compilation flag, the optimized programs achieved speedups ranging from 1.34x to 21.5x.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University 2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security 3The State Key Laboratory of Blockchain and Data Security, Zhejiang University EMAIL
Pseudocode	Yes	The analysis process is detailed in the Algorithm 1 in Appendix A.
Open Source Code	No	No explicit statement about the release of POLO's source code or a link to a repository is provided in the paper.
Open Datasets	Yes	Quant [Smidt, 2012] C++ 5 1 9 315 Quantitative Trade Framework AStar (Private) C++ 10 9 18 484 A* search Algorithm Medium Skip List (Private) C++ 6 7 44 886 Skip List Implementation AES [Conte, 2015] C 3 0 31 1409 Crypto Algorithm Hard KGraph [DBAIWang Group, 2021] C++ 8 22 109 3937 Nearest Neighbor Search Mini SQL (Private) C++ 129 130 772 15915 Small Database System ... KGraph Audio [Group, 2006], Sift1M [Jegou et al., 2010] Mini SQL provided 83.73% AES [Bell et al., 1990] 93.69%
Dataset Splits	No	No explicit train/test/validation dataset splits are provided as the paper focuses on optimizing existing C/C++ projects and evaluating their execution time, rather than training a machine learning model on a dataset with defined splits.
Hardware Specification	No	No specific hardware details such as GPU/CPU models, memory, or specific computing environments used for running the experiments are mentioned in the paper.
Software Dependencies	No	We use the Callgrind profiling tool [Weidendorfer, 2012] for dynamic analysis. ... We use Clang s Lib Tooling [Team, 2007] to construct an Abstract Syntax Tree (AST) for each source file. ... To balance effectiveness and cost, we use GPT-4o [Open AI, 2024] as the default LLM agent. These mentions lack specific version numbers for Callgrind and Clang's Lib Tooling, and GPT-4o is an LLM model, not a versioned software library or tool in the traditional sense.
Experiment Setup	Yes	To balance effectiveness and cost, we use GPT-4o [Open AI, 2024] as the default LLM agent. We set the (temperature: 0.2) to ensure the results are as deterministic and reproducible as possible. ... For each code optimization, we adopt a Top-N selection criteria, i.e., generating N results and selecting the best one among them. In our experiments, N=5. ... For all projects, we conduct tests using both O0 and O3 flags. ... Each project is executed five times, and the average execute time is reported.