reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Authors: Pranjal Aggarwal, Bryan Parno, Sean Welleck

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show Alpha Verus enables Llama-70B to successfully generate verified solutions to 33% of Human Eval-Verified (The Human Eval-Verus Contributors, 2024), outperforming GPT4o-based methods. Furthermore, through ablations, we establish the necessity of each component in Alpha Verus.
Researcher Affiliation	Academia	1Carnegie Mellon University. Correspondence to: Pranjal Aggarwal <EMAIL>, Bryan Parno <EMAIL>, Sean Welleck <EMAIL>.
Pseudocode	Yes	The complete algorithm is listed in Algorithm 1 and visualized in Figure 1.
Open Source Code	Yes	1Code is available at https://alphaverus.github. io
Open Datasets	Yes	We use Dafny Bench (Loughridge et al., 2025) a dataset of 562 programs of varying difficulty. ... We evaluate on verified versions of the MBPP and Human Eval datasets. In particular, MBPP-verified is sourced from (Yang et al., 2024; Misu et al., 2024) ... Human Eval-Verus is sourced from a concurrent open-source effort (The Human Eval-Verus Contributors, 2024)
Dataset Splits	No	The paper evaluates on verified versions of the MBPP and Human Eval datasets, which are benchmark datasets. It uses collected data (DAFNY2VERUS-COLLECTION) as few-shot exemplars for models. However, it does not specify explicit training, validation, or test splits for these benchmark datasets in a way that would typically be provided for reproduction of a model's training process for general model training, rather using collected data for few-shot prompting.
Hardware Specification	Yes	We use L40S GPUs for inference.
Software Dependencies	No	We use Sg Lang for inference (Zheng et al., 2024).
Experiment Setup	Yes	We use consistent decoding parameters, with temperature set to 0.7, top-p set to 1.0 and max tokens set to 2048. For the translation step, we generate 256 examples per program in the translation phase. We set breadth and depth to 32 and 8 in the treefinement stage. α is set to 0.1 and β is set to 0.03 as defined in Equation 3. We set the rebase node sampling temperature to 0.1. We generate 32 samples for the comparison model and exploit model. We use the same setting in both inference and translation. For stochastic sampling as described in Equation 3.1, we randomly choose k/2 examples from the pool of k exemplars. All sampling is done with a batch size of 32. We do not tune hyperparameters and use the conventional settings throughout.