Can Large Language Models Understand Intermediate Representations in Compilers?

Authors: Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs GPT-4, GPT-3, Deep Seek, Gemma 2, Llama 3, and Code Llama in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, IR decompilation, code summarization, and execution reasoning.
Researcher Affiliation Academia 1Kent State University, USA 2Huazhong University of Science and Technology, China 3Pacific Northwest National Laboratory, USA 4Chongqing University, China. Correspondence to: Yao Wan <EMAIL>, Bo Fang <EMAIL>, Qiang Guan <EMAIL>.
Pseudocode No The paper describes methods and tasks but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All the experimental data and source code are publicly available at https://github.com/hjiang13/LLM4IR.
Open Datasets Yes All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures.
Dataset Splits No The paper uses the Human Eval benchmark which comprises 164 programming tasks. It does not explicitly mention training/test/validation dataset splits, as the focus is on evaluating LLMs on these benchmark tasks rather than training a new model with specific splits.
Hardware Specification Yes The compilation experiments were conducted on a Dell Workstation equipped with 32 Intel(R) Xeon(R) CPUs E5-2620 v4 @ 2.10GHz, running on an x86-64 architecture with a 64-bit system.
Software Dependencies Yes For these experiments, we used Clang adapted for LLVM 13 on Ubuntu 18.04.
Experiment Setup Yes All evaluations are conducted on a benchmark dataset derived from Human Eval (Zheng et al., 2023), consisting of 164 C++ programs paired with their corresponding LLVM IRs. Each program is compiled using Clang at four optimization levels -O0, -O1, -O2, and -O3 to generate a diverse set of LLVM IRs that capture both unoptimized and progressively optimized code structures. To enhance response precision and consistency, we adopt an Expert Meta-Template Prompt format. For each of the four tasks (i.e., CFG reconstruction, IR decompilation, code summarization, and execution reasoning), we iteratively refine prompts using strategies such as few-shot learning and Co T prompting (Wei et al., 2022; Xie et al., 2025).