FormalAlign: Automated Alignment Evaluation for Autoformalization
Authors: Jianqiao Lu, Yingjia Wan, Yinya Huang, Jing Xiong, Zhengying Liu, Zhijiang Guo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated across four benchmarks augmented by our proposed misalignment strategies, FORMALALIGN demonstrates superior performance. In our experiments, FORMALALIGN outperforms GPT-4, achieving an Alignment-Selection Score 11.58% higher on Form L4-Basic (99.21% vs. 88.91%) and 3.19% higher on Mini F2F-Valid (66.39% vs. 64.34%). This effective alignment evaluation significantly reduces the need for manual verification. |
| Researcher Affiliation | Collaboration | Jianqiao Lu1 , Yingjia Wan2 , Yinya Huang4, Jing Xiong1, Zhengying Liu3, Zhijiang Guo5 1The University of Hong Kong 2University of Cambridge 3Huawei Noah s Ark Lab 4ETH Zurich 5Hong Kong University of Science and Technology (Guangzhou) |
| Pseudocode | No | The paper describes its method in Section 3 using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode blocks or algorithm figures. |
| Open Source Code | No | The paper refers to a third-party tool, the Lean 4 Compiler, stating, 'The Lean 4 Compiler in our scope is referred to as the tool available at https://github.com/leanprover-community/repl.' However, it does not explicitly state that the source code for the FORMALALIGN methodology described in this paper is publicly available or provide a link to its own code repository. |
| Open Datasets | Yes | We evaluate FORMALALIGN on four benchmarks sourced from Mini F2F (Zheng et al., 2022b) and Form L4 (Lu et al., 2024). In our experimental setup, we conduct fine-tuning on the Form L4 (Lu et al., 2024) and MMA (Jiang et al., 2023a) training sets, both of which are derived from Mathlib, a library of fundamental mathematical statements. |
| Dataset Splits | No | To thoroughly evaluate our method s ability to align informal mathematical statements with formal language, we employ a comprehensive set of test sets that covers both in-domain and out-of-domain data. Specifically, we use four distinct test sets: the basic and random test sets from Form L4, and the valid and test sets from Mini F2F (Zheng et al., 2022a). Our original model is fine-tuned on a combination of the Form L4 training set and the MMA training set from Mathlib. While specific test and training sets are mentioned, the paper does not provide explicit percentages or sample counts for training/validation/test splits within these datasets. |
| Hardware Specification | Yes | Our experiments are conducted in a computing environment with 8 NVIDIA A100 GPUs, each with 40GB of memory. |
| Software Dependencies | No | The paper mentions using specific models like Mistral-7B, GPT-4, and GPT-3.5 and the Adam W optimizer. However, it does not provide specific version numbers for underlying software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python) that would be necessary to fully replicate the experimental environment. |
| Experiment Setup | Yes | We employ the Adam W optimizer for model training over 1 epoch, with a batch size of 512. The learning rate is set at 2e 10 6, incorporating a 3% learning rate warmup period. Table 7: Finetuning Hyperparameters. Global Batch Size 128 LR 5 * 10^-6 Epo. 1 Max Length 2048 Weight Decay 0 Warmup Ratio 0.03 |