Automated Proof Generation for Rust Code via Self-Evolution
Authors: Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng CHENG, Fan Yang, Shuvendu Lahiri, Tao Xie, Lidong Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, SAFE leverages Deep Seek Coder (Guo et al., 2024) as the generator, successfully synthesizing 19,017 formal specifications and 9,706 verified Rust functions...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test, demonstrates that SAFE empowers Deep Seek Coder, which is initially unacquainted with Verus, to achieve 43.17% and 43.83% accuracy on the two benchmarks by direct generation, far surpassing GPT-4o s performance of 11.51% and 0.28%, respectively. Furthermore, the model s accuracy reaches 79.14% in Verus Bench and 48.43% in Code Net-Test once its self-debugging feature is used. |
| Researcher Affiliation | Collaboration | 1 Peking University, {tychen811,yh0315,taoxie}pku.edu.cn 2 Microsoft Research, EMAIL 3 University of Illinois at Urbana-Champaign, EMAIL, 4 Columbia University, EMAIL 5 University of California Irvine, EMAIL |
| Pseudocode | Yes | As listed in Algorithm 1, both procedures use GPT-4o to generate the round-0 data spec/proof-data0. |
| Open Source Code | No | The paper leverages existing open-source models (Deep Seek Coder, LLaMa3.1) and tools (Verus, Z3) but does not provide an explicit statement or link for the source code of the SAFE framework itself. |
| Open Datasets | Yes | We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test...38 of them come from the dataset of SV-COMP-2021 (Beyer, 2021)...The remaining 78 tasks come from MBPP-DFY-153 (Misu et al., 2024) |
| Dataset Splits | No | We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...To comprehensively measure the model capability of generating proofs, we split a subset of Code Net for testing, ensuring that it is not utilized in our data synthesis process. The paper mentions using training splits of existing datasets and splitting a subset of Code Net for testing, but does not provide specific percentages or counts for how their models' training, validation, or test sets were created beyond these high-level descriptions. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware (GPU, CPU models, memory, etc.) used for running the experiments. It mentions computational effort in terms of 'one whole month of non-stop GPT-4o invocations' but not the underlying hardware. |
| Software Dependencies | No | We employ the Deep Seek Coder model (Guo et al., 2024) as the generator...We also employ Llama3.1-8B-Instruct model (Dubey et al., 2024)...We use Verus to symbolically evaluate...Verus uses an SMT solver (e.g., Z3 (De Moura & Bjørner, 2008)). While the paper mentions the models and tools used, it does not provide specific version numbers for software dependencies such as Python, PyTorch, CUDA, or specific versions of Verus or Z3, which would be needed for replication. |
| Experiment Setup | Yes | For specification generation, we run two rounds. Proof generation is three. At each round of fine-tuning, we combine the verified programs and self-debugging data pairs together as training data, and train Deep Seek Coder 5 epochs, using a batch size of 128 and a learning rate of 1 10 5. |