BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models
Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. In summary, Bin Metric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2Anhui Province Key Laboratory of Digital Security, Hefei, China 3QI-ANXIN Technology Research Institute, Beijing, China |
| Pseudocode | No | The paper uses the term 'decompiled pseudo code' to describe input data for several tasks (SR, BCS, AC), but it does not present its own algorithms or methods in structured pseudocode blocks. |
| Open Source Code | No | The paper mentions using 20 open-source projects as data sources (e.g., 'we curate 20 high-star C language projects from Git Hub') and refers to 'Bin Metric' as a benchmark, but it does not provide an explicit statement about releasing the source code for its own methodology or the Bin Metric evaluation pipeline. |
| Open Datasets | Yes | Bin Metric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. |
| Dataset Splits | No | The paper describes the composition and number of question items for each task within the Bin Metric benchmark (e.g., 'we extract 1,000 question items', 'randomly sample 250 pairs', '70 assembly snippets are sampled', etc.), but it does not provide explicit training/test/validation splits for the evaluation conducted with the LLMs. |
| Hardware Specification | Yes | The experiments are conducted on an Ubuntu 22.04 server with 8 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions 'Open-source LLMs are downloaded from Huggingface' and 'half-precision in FP16 enabled for inference', but it does not specify concrete software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Given the context window limitations, we set max length to 8192 and max new tokens to 2048. Since accuracy is prioritized over diversity in most code-related tasks, the sampling temperature is set to 0.1, with top k and top p both set to 1 for deterministic responses. |