Automated Proof Generation for Rust Code via Self-Evolution

Authors: Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng CHENG, Fan Yang, Shuvendu Lahiri, Tao Xie, Lidong Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, SAFE leverages Deep Seek Coder (Guo et al., 2024) as the generator, successfully synthesizing 19,017 formal specifications and 9,706 verified Rust functions...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test, demonstrates that SAFE empowers Deep Seek Coder, which is initially unacquainted with Verus, to achieve 43.17% and 43.83% accuracy on the two benchmarks by direct generation, far surpassing GPT-4o s performance of 11.51% and 0.28%, respectively. Furthermore, the model s accuracy reaches 79.14% in Verus Bench and 48.43% in Code Net-Test once its self-debugging feature is used.
Researcher Affiliation Collaboration 1 Peking University, {tychen811,yh0315,taoxie}pku.edu.cn 2 Microsoft Research, EMAIL 3 University of Illinois at Urbana-Champaign, EMAIL, 4 Columbia University, EMAIL 5 University of California Irvine, EMAIL
Pseudocode Yes As listed in Algorithm 1, both procedures use GPT-4o to generate the round-0 data spec/proof-data0.
Open Source Code No The paper leverages existing open-source models (Deep Seek Coder, LLaMa3.1) and tools (Verus, Z3) but does not provide an explicit statement or link for the source code of the SAFE framework itself.
Open Datasets Yes We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test...38 of them come from the dataset of SV-COMP-2021 (Beyer, 2021)...The remaining 78 tasks come from MBPP-DFY-153 (Misu et al., 2024)
Dataset Splits No We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...To comprehensively measure the model capability of generating proofs, we split a subset of Code Net for testing, ensuring that it is not utilized in our data synthesis process. The paper mentions using training splits of existing datasets and splitting a subset of Code Net for testing, but does not provide specific percentages or counts for how their models' training, validation, or test sets were created beyond these high-level descriptions.
Hardware Specification No The paper does not explicitly describe any specific hardware (GPU, CPU models, memory, etc.) used for running the experiments. It mentions computational effort in terms of 'one whole month of non-stop GPT-4o invocations' but not the underlying hardware.
Software Dependencies No We employ the Deep Seek Coder model (Guo et al., 2024) as the generator...We also employ Llama3.1-8B-Instruct model (Dubey et al., 2024)...We use Verus to symbolically evaluate...Verus uses an SMT solver (e.g., Z3 (De Moura & Bjørner, 2008)). While the paper mentions the models and tools used, it does not provide specific version numbers for software dependencies such as Python, PyTorch, CUDA, or specific versions of Verus or Z3, which would be needed for replication.
Experiment Setup Yes For specification generation, we run two rounds. Proof generation is three. At each round of fine-tuning, we combine the verified programs and self-debugging data pairs together as training data, and train Deep Seek Coder 5 epochs, using a batch size of 128 and a learning rate of 1 10 5.