Pre-Training Representations of Binary Code Using Contrastive Learning

Authors: Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, Yu Huang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of Contra Bin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that Contra Bin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. Our extensive evaluation on four diverse tasks functionality classification, name recovery, summarization, and reverse engineering shows that Contra Bin consistently and significantly outperforms strong pre-trained baselines, demonstrating the broad applicability of its embeddings.
Researcher Affiliation Academia Yifan Zhang EMAIL Department of Computer Science Vanderbilt University, Chen Huang EMAIL Department of Computer Science National University of Singapore, Yueke Zhang EMAIL Department of Computer Science Vanderbilt University, Huajie Shao EMAIL Department of Computer Science College of William & Mary, Kevin Leach EMAIL Department of Computer Science Vanderbilt University, Yu Huang EMAIL Department of Computer Science Vanderbilt University
Pseudocode Yes Algorithm 1 Contra Bin pre-training framework
Open Source Code Yes To facilitate further research and ensure the reproducibility of our results, we have made our complete implementation, pre-trained models, and all datasets publicly available.2 2https://zenodo.org/records/15219264. The data and code used in this study are publicly available on Zenodo at https://zenodo.org/records/1 5219264.
Open Datasets Yes The dataset used in this study is available for further research. To facilitate further research and ensure the reproducibility of our results, we have made our complete implementation, pre-trained models, and all datasets publicly available.2 2https://zenodo.org/records/15219264. The repository includes: Preprocessed datasets used for training and evaluation. For the binary functional algorithm classification task (RQ1: Downstream Task 1), we adopt POJ-104 from the Code XGLUE dataset (Lu et al., 2021). For the binary function name recovery task (RQ2), we chose the DIRE dataset used to train DIRTY model(Lacomis et al., 2019). For the binary code summarization and reverse engineering tasks, we use the Angha Bench test set (Da Silva et al., 2021) during pre-training.
Dataset Splits Yes POJ-104 For the binary functional algorithm classification task (RQ1: Downstream Task 1), we adopt POJ-104 from the Code XGLUE dataset (Lu et al., 2021). [...] The dataset is categorized into train/development/test sets with non-overlapping program problem labels. [...] Table 2: Dataset statistics for algorithmic classification. POJ-104 (Functionality Classification) Data Type # Problems # Examples Train 64 14,614 Dev 16 5,079 Test 24 8,102. DIRE For the binary function name recovery task (RQ2), we chose the DIRE dataset used to train DIRTY model(Lacomis et al., 2019). [...] The dataset statistics are shown in Table 2. DIRE (Function Name Recovery) Data Type # Names # Examples Train 91 49,933 Dev 91 2,774 Test 91 2,775. Angha Bench For the binary code summarization and reverse engineering tasks, we use the Angha Bench test set (Da Silva et al., 2021) during pre-training [...] The dataset statistics are displayed in Table 2. Angha Bench (Code Summarization) Data Type # Examples Train 16,383 Dev 910 Test 911. Angha Bench (Reverse Engineering) Data Type # Examples Train 16,383 Dev 910 Test 911.
Hardware Specification Yes We use two Nvidia A40 GPUs during model pre-training and follow the parameter settings of Simple CLIP (Shariatnia, 2021). For this task, we utilized 8 Nvidia A100 GPUs for Contra Bin (Code T5) model pre-training, with the random seed set to 42 to ensure reproducibility.
Software Dependencies No After obtaining the source code from Angha Bench, we compile the code snippets in Angha Bench using Clang (specifically, LLVM5) to generate the corresponding assembly code. We adopt an Encoder-Decoder Code T5 (Wang et al., 2021) model to automatically generate a single comment for each snippet of source code in our dataset. The paper mentions software tools like Clang and LLVM, and models like Code T5, but it does not provide specific version numbers for these or other software libraries/frameworks, which are essential for a reproducible description of software dependencies.
Experiment Setup Yes We use two Nvidia A40 GPUs during model pre-training and follow the parameter settings of Simple CLIP (Shariatnia, 2021). We set the random seed as 42 in the pre-training for reproduction. To better evaluate the performance and robustness, we also train two versions of our model, Contra Bin-PCL and Contra Bin. For Contra Bin-PCL, we set 10 epochs for primary contrastive learning only. For Contra Bin, we use 10, 10, and 10 to improve the overall model efficiency on binary code analysis, and 10, 30, 30 to enhance the general model performance on binary code comprehension. For the binary functional algorithm classification task... We fine-tune each model with 2 epochs and a block size of 400. We use training and validation batch sizes of 32 and 8, respectively, and choose learning rate to be 2e-5 and maximal gradient normalization to be 1. In all fine-tuning processes, we use the default random seed of 123456. For the binary function name recovery task... We fine-tune each model with 5 epochs and a block size of 256. We use training and validation batch sizes of 8 and 16, respectively, and choose learning rate to be 2e-5 and maximal gradient normalization to be 1. In all fine-tuning processes, we use the default random seed as 123456. For the binary code summarization and reverse engineering tasks... we defined the input length for both the summarization and reverse engineering tasks as 512, and the output length for summarization and translation as 32 and 512, respectively. We use training and evaluation batch size of 16 and beam size of 5. We set the number of epochs for the summarization task to 5 and the batch number for the reverse engineering task to 20000. For reproducibility, the random seed is set to 42 for both tasks. For this task, we utilized 8 Nvidia A100 GPUs for Contra Bin (Code T5) model pre-training, with the random seed set to 42 to ensure reproducibility. Since the focus was exclusively on binary code analysis rather than comprehension, we adjusted our training strategy accordingly. We reduced the training duration to 40% of the previous rounds to concentrate the model s efforts on this specific task.