reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can Large Language Models Understand Intermediate Representations in Compilers?

Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs GPT-4, GPT-3, Deep Seek, Gemma 2, Llama 3, and Code Llama in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, IR decompilation, code summarization, and execution reasoning.
Researcher Affiliation	Academia	1Kent State University, USA 2Huazhong University of Science and Technology, China 3Pacific Northwest National Laboratory, USA 4Chongqing University, China. Correspondence to: Yao Wan <EMAIL>, Bo Fang <EMAIL>, Qiang Guan <EMAIL>.
Pseudocode	No	The paper describes methods and tasks but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All the experimental data and source code are publicly available at https://github.com/hjiang13/LLM4IR.
Open Datasets	Yes	All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures.
Dataset Splits	No	The paper uses the Human Eval benchmark which comprises 164 programming tasks. It does not explicitly mention training/test/validation dataset splits, as the focus is on evaluating LLMs on these benchmark tasks rather than training a new model with specific splits.
Hardware Specification	Yes	The compilation experiments were conducted on a Dell Workstation equipped with 32 Intel(R) Xeon(R) CPUs E5-2620 v4 @ 2.10GHz, running on an x86-64 architecture with a 64-bit system.
Software Dependencies	Yes	For these experiments, we used Clang adapted for LLVM 13 on Ubuntu 18.04.
Experiment Setup	Yes	All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. To enhance response precision and consistency, we adopt an Expert Meta-Template Prompt format. For each of the four tasks (i.e., CFG reconstruction, IR decompilation, code summarization, and execution reasoning), we iteratively refine prompts using strategies such as few-shot learning and Co T prompting (Wei et al., 2022; Xie et al., 2025).