reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProSec: Fortifying Code LLMs with Proactive Security Alignment

Authors: Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, Xiangyu Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that models trained with PROSEC are 25.2% to 35.4% more secure compared to previous work without degrading models utility. ... We demonstrate the effectiveness of PROSEC on the Purple Llama (Bhatt et al., 2023) secure coding benchmark. The models trained with the dataset synthesized by PROSEC are 25.2% 35.4% more secure than those trained with the Safe Coder dataset. We further validate that PROSEC does not harm the utility of code LLMs. We conduct thorough ablation studies to justify the design decisions in PROSEC.
Researcher Affiliation	Academia	1Department of Computer Science, Purdue University, IN, USA 2Department of Computer Science, Rutgers University, NJ, USA. Correspondence to: Xiangzhe Xu <EMAIL>, Zian Su <EMAIL>.
Pseudocode	Yes	Algorithm 1 Vulnerability-inducing instruction generation
Open Source Code	Yes	We publish a dataset of synthesized vulnerabilityinducing instructions that can effectively expose the weakness of code LLMs. PROSEC and the dataset are available at https://github.com/PurCL/ProSec.
Open Datasets	Yes	Seed Instruction-Tuning Dataset We use the code-related part of Infinity-Instruct 2 (BAAI, 2024) as our seed instruction dataset for data synthesis. ... Test Dataset We use Purple Llama (Bhatt et al., 2023) as the test dataset for code model safety. ... We use the multi-lingual version of Humaneval (Chen et al., 2021; Guo et al., 2024a) and the multi-lingual version of MBPP (Austin et al., 2021) (denoted as MXEval (Athiwaratkun et al., 2022)) as the test dataset for utility. ... We publish a dataset of synthesized vulnerabilityinducing instructions that can effectively expose the weakness of code LLMs. PROSEC and the dataset are available at https://github.com/PurCL/ProSec.
Dataset Splits	No	The paper describes how the preference dataset is constructed and selected, but it does not specify explicit training, validation, or test splits for this dataset that would be needed to reproduce their experimental results. It mentions using subsets of existing benchmarks for evaluation but not the specific splits of their own generated data.
Hardware Specification	Yes	We run the training of PROSEC on 2 NVIDIA A100-40G GPUs.
Software Dependencies	No	The paper mentions several preference optimization methods (DPO, IPO, ORPO, SimPO) and LoRA for parameter-efficient training, along with their hyperparameters. However, it does not provide specific version numbers for these libraries or other underlying software such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	The major preference optimization-related hyperparameters in our experiments are shown in Table 6. For training, we set the total batch size to 64. We adopt Lo RA (Hu et al., 2021) for parameter-efficient training of the target model. The rank r = 8 and α = 16 for all our experiments. We run the training of PROSEC on 2 NVIDIA A100-40G GPUs. ... Warm-up Training for Influence Score We train each target model on Dsec for 1k steps and leverage checkpoints of every 100 steps to compute the training dynamics for Dnorm data influence score computation.