reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FormalAlign: Automated Alignment Evaluation for Autoformalization

Authors: Jianqiao Lu, Yingjia Wan, Yinya Huang, Jing Xiong, Zhengying Liu, Zhijiang Guo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated across four benchmarks augmented by our proposed misalignment strategies, FORMALALIGN demonstrates superior performance. In our experiments, FORMALALIGN outperforms GPT-4, achieving an Alignment-Selection Score 11.58% higher on Form L4-Basic (99.21% vs. 88.91%) and 3.19% higher on Mini F2F-Valid (66.39% vs. 64.34%). This effective alignment evaluation significantly reduces the need for manual verification.
Researcher Affiliation	Collaboration	Jianqiao Lu1 , Yingjia Wan2 , Yinya Huang4, Jing Xiong1, Zhengying Liu3, Zhijiang Guo5 1The University of Hong Kong 2University of Cambridge 3Huawei Noah s Ark Lab 4ETH Zurich 5Hong Kong University of Science and Technology (Guangzhou)
Pseudocode	No	The paper describes its method in Section 3 using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode blocks or algorithm figures.
Open Source Code	No	The paper refers to a third-party tool, the Lean 4 Compiler, stating, 'The Lean 4 Compiler in our scope is referred to as the tool available at https://github.com/leanprover-community/repl.' However, it does not explicitly state that the source code for the FORMALALIGN methodology described in this paper is publicly available or provide a link to its own code repository.
Open Datasets	Yes	We evaluate FORMALALIGN on four benchmarks sourced from Mini F2F (Zheng et al., 2022b) and Form L4 (Lu et al., 2024). In our experimental setup, we conduct fine-tuning on the Form L4 (Lu et al., 2024) and MMA (Jiang et al., 2023a) training sets, both of which are derived from Mathlib, a library of fundamental mathematical statements.
Dataset Splits	No	To thoroughly evaluate our method s ability to align informal mathematical statements with formal language, we employ a comprehensive set of test sets that covers both in-domain and out-of-domain data. Specifically, we use four distinct test sets: the basic and random test sets from Form L4, and the valid and test sets from Mini F2F (Zheng et al., 2022a). Our original model is fine-tuned on a combination of the Form L4 training set and the MMA training set from Mathlib. While specific test and training sets are mentioned, the paper does not provide explicit percentages or sample counts for training/validation/test splits within these datasets.
Hardware Specification	Yes	Our experiments are conducted in a computing environment with 8 NVIDIA A100 GPUs, each with 40GB of memory.
Software Dependencies	No	The paper mentions using specific models like Mistral-7B, GPT-4, and GPT-3.5 and the Adam W optimizer. However, it does not provide specific version numbers for underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python) that would be necessary to fully replicate the experimental environment.
Experiment Setup	Yes	We employ the Adam W optimizer for model training over 1 epoch, with a batch size of 512. The learning rate is set at 2e 10 6, incorporating a 3% learning rate warmup period. Table 7: Finetuning Hyperparameters. Global Batch Size 128 LR 5 * 10^-6 Epo. 1 Max Length 2048 Weight Decay 0 Warmup Ratio 0.03