reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. In summary, Bin Metric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.
Researcher Affiliation	Collaboration	1University of Science and Technology of China, Hefei, China 2Anhui Province Key Laboratory of Digital Security, Hefei, China 3QI-ANXIN Technology Research Institute, Beijing, China
Pseudocode	No	The paper uses the term 'decompiled pseudo code' to describe input data for several tasks (SR, BCS, AC), but it does not present its own algorithms or methods in structured pseudocode blocks.
Open Source Code	No	The paper mentions using 20 open-source projects as data sources (e.g., 'we curate 20 high-star C language projects from Git Hub') and refers to 'Bin Metric' as a benchmark, but it does not provide an explicit statement about releasing the source code for its own methodology or the Bin Metric evaluation pipeline.
Open Datasets	Yes	Bin Metric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations.
Dataset Splits	No	The paper describes the composition and number of question items for each task within the Bin Metric benchmark (e.g., 'we extract 1,000 question items', 'randomly sample 250 pairs', '70 assembly snippets are sampled', etc.), but it does not provide explicit training/test/validation splits for the evaluation conducted with the LLMs.
Hardware Specification	Yes	The experiments are conducted on an Ubuntu 22.04 server with 8 NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions 'Open-source LLMs are downloaded from Huggingface' and 'half-precision in FP16 enabled for inference', but it does not specify concrete software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Given the context window limitations, we set max length to 8192 and max new tokens to 2048. Since accuracy is prioritized over diversity in most code-related tasks, the sampling temperature is set to 0.1, with top k and top p both set to 1 for deterministic responses.