BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. In summary, Bin Metric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2Anhui Province Key Laboratory of Digital Security, Hefei, China 3QI-ANXIN Technology Research Institute, Beijing, China
Pseudocode No The paper uses the term 'decompiled pseudo code' to describe input data for several tasks (SR, BCS, AC), but it does not present its own algorithms or methods in structured pseudocode blocks.
Open Source Code No The paper mentions using 20 open-source projects as data sources (e.g., 'we curate 20 high-star C language projects from Git Hub') and refers to 'Bin Metric' as a benchmark, but it does not provide an explicit statement about releasing the source code for its own methodology or the Bin Metric evaluation pipeline.
Open Datasets Yes Bin Metric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations.
Dataset Splits No The paper describes the composition and number of question items for each task within the Bin Metric benchmark (e.g., 'we extract 1,000 question items', 'randomly sample 250 pairs', '70 assembly snippets are sampled', etc.), but it does not provide explicit training/test/validation splits for the evaluation conducted with the LLMs.
Hardware Specification Yes The experiments are conducted on an Ubuntu 22.04 server with 8 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions 'Open-source LLMs are downloaded from Huggingface' and 'half-precision in FP16 enabled for inference', but it does not specify concrete software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Given the context window limitations, we set max length to 8192 and max new tokens to 2048. Since accuracy is prioritized over diversity in most code-related tasks, the sampling temperature is set to 0.1, with top k and top p both set to 1 for deterministic responses.