reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automated Proof Generation for Rust Code via Self-Evolution

Authors: Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng CHENG, Fan Yang, Shuvendu Lahiri, Tao Xie, Lidong Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, SAFE leverages Deep Seek Coder (Guo et al., 2024) as the generator, successfully synthesizing 19,017 formal specifications and 9,706 verified Rust functions...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test, demonstrates that SAFE empowers Deep Seek Coder, which is initially unacquainted with Verus, to achieve 43.17% and 43.83% accuracy on the two benchmarks by direct generation, far surpassing GPT-4o s performance of 11.51% and 0.28%, respectively. Furthermore, the model s accuracy reaches 79.14% in Verus Bench and 48.43% in Code Net-Test once its self-debugging feature is used.
Researcher Affiliation	Collaboration	1 Peking University, {tychen811,yh0315,taoxie}pku.edu.cn 2 Microsoft Research, EMAIL 3 University of Illinois at Urbana-Champaign, EMAIL, 4 Columbia University, EMAIL 5 University of California Irvine, EMAIL
Pseudocode	Yes	As listed in Algorithm 1, both procedures use GPT-4o to generate the round-0 data spec/proof-data0.
Open Source Code	No	The paper leverages existing open-source models (Deep Seek Coder, LLaMa3.1) and tools (Verus, Z3) but does not provide an explicit statement or link for the source code of the SAFE framework itself.
Open Datasets	Yes	We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...Our evaluation on a human-curated Verus benchmark with human-written specifications, Verus Bench, and a synthetic benchmark, Code Net-Test...38 of them come from the dataset of SV-COMP-2021 (Beyer, 2021)...The remaining 78 tasks come from MBPP-DFY-153 (Misu et al., 2024)
Dataset Splits	No	We employ the MBPP dataset (Austin et al., 2021) (only training split for data synthesis) and the Code Net dataset (Puri et al., 2021) as our data sources...To comprehensively measure the model capability of generating proofs, we split a subset of Code Net for testing, ensuring that it is not utilized in our data synthesis process. The paper mentions using training splits of existing datasets and splitting a subset of Code Net for testing, but does not provide specific percentages or counts for how their models' training, validation, or test sets were created beyond these high-level descriptions.
Hardware Specification	No	The paper does not explicitly describe any specific hardware (GPU, CPU models, memory, etc.) used for running the experiments. It mentions computational effort in terms of 'one whole month of non-stop GPT-4o invocations' but not the underlying hardware.
Software Dependencies	No	We employ the Deep Seek Coder model (Guo et al., 2024) as the generator...We also employ Llama3.1-8B-Instruct model (Dubey et al., 2024)...We use Verus to symbolically evaluate...Verus uses an SMT solver (e.g., Z3 (De Moura & Bjørner, 2008)). While the paper mentions the models and tools used, it does not provide specific version numbers for software dependencies such as Python, PyTorch, CUDA, or specific versions of Verus or Z3, which would be needed for replication.
Experiment Setup	Yes	For specification generation, we run two rounds. Proof generation is three. At each round of fine-tuning, we combine the verified programs and self-debugging data pairs together as training data, and train Deep Seek Coder 5 epochs, using a batch size of 128 and a learning rate of 1 10 5.