Can Large Language Models Understand Intermediate Representations in Compilers?
Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs GPT-4, GPT-3, Deep Seek, Gemma 2, Llama 3, and Code Llama in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, IR decompilation, code summarization, and execution reasoning. |
| Researcher Affiliation | Academia | 1Kent State University, USA 2Huazhong University of Science and Technology, China 3Pacific Northwest National Laboratory, USA 4Chongqing University, China. Correspondence to: Yao Wan <EMAIL>, Bo Fang <EMAIL>, Qiang Guan <EMAIL>. |
| Pseudocode | No | The paper describes methods and tasks but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the experimental data and source code are publicly available at https://github.com/hjiang13/LLM4IR. |
| Open Datasets | Yes | All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. |
| Dataset Splits | No | The paper uses the Human Eval benchmark which comprises 164 programming tasks. It does not explicitly mention training/test/validation dataset splits, as the focus is on evaluating LLMs on these benchmark tasks rather than training a new model with specific splits. |
| Hardware Specification | Yes | The compilation experiments were conducted on a Dell Workstation equipped with 32 Intel(R) Xeon(R) CPUs E5-2620 v4 @ 2.10GHz, running on an x86-64 architecture with a 64-bit system. |
| Software Dependencies | Yes | For these experiments, we used Clang adapted for LLVM 13 on Ubuntu 18.04. |
| Experiment Setup | Yes | All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. To enhance response precision and consistency, we adopt an Expert Meta-Template Prompt format. For each of the four tasks (i.e., CFG reconstruction, IR decompilation, code summarization, and execution reasoning), we iteratively refine prompts using strategies such as few-shot learning and Co T prompting (Wei et al., 2022; Xie et al., 2025). |