AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement
Authors: Pranjal Aggarwal, Bryan Parno, Sean Welleck
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show Alpha Verus enables Llama-70B to successfully generate verified solutions to 33% of Human Eval-Verified (The Human Eval-Verus Contributors, 2024), outperforming GPT4o-based methods. Furthermore, through ablations, we establish the necessity of each component in Alpha Verus. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University. Correspondence to: Pranjal Aggarwal <EMAIL>, Bryan Parno <EMAIL>, Sean Welleck <EMAIL>. |
| Pseudocode | Yes | The complete algorithm is listed in Algorithm 1 and visualized in Figure 1. |
| Open Source Code | Yes | 1Code is available at https://alphaverus.github. io |
| Open Datasets | Yes | We use Dafny Bench (Loughridge et al., 2025) a dataset of 562 programs of varying difficulty. ... We evaluate on verified versions of the MBPP and Human Eval datasets. In particular, MBPP-verified is sourced from (Yang et al., 2024; Misu et al., 2024) ... Human Eval-Verus is sourced from a concurrent open-source effort (The Human Eval-Verus Contributors, 2024) |
| Dataset Splits | No | The paper evaluates on verified versions of the MBPP and Human Eval datasets, which are benchmark datasets. It uses collected data (DAFNY2VERUS-COLLECTION) as few-shot exemplars for models. However, it does not specify explicit training, validation, or test splits for these benchmark datasets in a way that would typically be provided for reproduction of a model's training process for general model training, rather using collected data for few-shot prompting. |
| Hardware Specification | Yes | We use L40S GPUs for inference. |
| Software Dependencies | No | We use Sg Lang for inference (Zheng et al., 2024). |
| Experiment Setup | Yes | We use consistent decoding parameters, with temperature set to 0.7, top-p set to 1.0 and max tokens set to 2048. For the translation step, we generate 256 examples per program in the translation phase. We set breadth and depth to 32 and 8 in the treefinement stage. α is set to 0.1 and β is set to 0.03 as defined in Equation 3. We set the rebase node sampling temperature to 0.1. We generate 32 samples for the comparison model and exploit model. We use the same setting in both inference and translation. For stochastic sampling as described in Equation 3.1, we randomly choose k/2 examples from the pool of k exemplars. All sampling is done with a batch size of 32. We do not tune hyperparameters and use the conventional settings throughout. |