reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PDE-Controller: LLMs for Autoformalization and Reasoning of PDEs

Authors: Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, Wuyang Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our PDE-Controller significantly outperforms prompting the latest open-source and GPT models in reasoning, autoformalization, and program synthesis, achieving up to a 62% improvement in utility gain for PDE control. We release all data, model checkpoints, and code at https: //pde-controller.github.io/. 4. Experiments We study 1D heat and wave problems as pioneering showcases. All our models are fine-tuned from Math Coder2, and we compare against few-shot evaluations of Math Coder2, GPT 4o, and GPT o1-mini (Achiam et al., 2023). Please read Appendix E for model and training details. 4.1. Accurate Autoformalization and Program Synthesis We first evaluate the performance of our Translator for autoformalization and Coder for program synthesis. ... Results3. The autoformalization can be evaluated using the intersection over union (Io U) between the predicted and target STLs (constraints). The code generation should aim for high executability and low utility RMSE simultaneously4. As in Table 4, our Translator and Coder achieve the best across all metrics with low deviations, indicating strong and reliable autoformalization and program synthesis.
Researcher Affiliation	Academia	1School of Computing Science, Simon Fraser University 2Department of Computer Science, Princeton University 3Department of Mathematics, Simon Fraser University 4Department of Physics, Simon Fraser University. Correspondence to: Wuyang Chen <EMAIL>.
Pseudocode	No	The paper describes the framework and reasoning steps in numbered lists within the text (e.g., in Section 3.1 Framework Overview and 3.4.1 What is Reasoning for PDE Control?), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release all data, model checkpoints, and code at https: //pde-controller.github.io/.
Open Datasets	Yes	We release all data, model checkpoints, and code at https: //pde-controller.github.io/. We build the first comprehensive datasets for PDE control designed for LLMs, including over 2 million samples of natural and formal language, code programs, as well as PDE control annotations. We also collect manually written samples by human volunteers to evaluate LLMs in real-world scenarios. Our novel dataset will serve as a high-quality testbed for future research in AI for PDE reasoning. The Croissant metadata URL can be found here: https://huggingface.co/datasets/delta-lab-ai/pde-controller/tree/main
Dataset Splits	Yes	Table 2: Our dataset for autoformalization and program synthesis. Num. Constraints 1 2 3 Total Num. STLs 6 72 1296 1374 Heat (Train) 3840 45792 817776 867408 Heat (Test) 960 11448 204768 217176 Wave (Train) 3840 45504 795744 845088 Wave (Test) 960 11304 196992 209256 ... We merge the training set for both heat and wave problems for the training of Translator and Coder. ... Table 8: Overview of our reasoning data. We threshold 3 difficulty levels of questions by the Success Rate P of random sampling. Heat Training Testing Total Num. (̕ (w), ̕ (l)) Pairs 4813 1181 5994 Easy P (0.8, 1) 27.1% 26.1% 26.9% Medium P (0.5, 0.8] 37.3% 37.8% 37.4% Hard P [0, 0.5] 35.6% 36.2% 35.7% Wave Training Testing Total Num. (̕ (w), ̕ (l)) Pairs 3812 966 4778 Easy P (0.88, 1) 32.5% 33.6% 32.7% Medium P (0.55, 0.88] 33.1% 32.5% 33.0% Hard P [0, 0.55] 34.4% 33.9% 34.3%
Hardware Specification	Yes	E. Training Details We leverage the pretrained Math Coder2-Deep Seek Math-7B (Lu et al., 2024) checkpoint (Math Coder2) which has a 4096token context length. ... E.1. Autoformalization: SFT of Translator The Translator was trained with two 6000Ada GPUs using a per-GPU batch size of 16 and 4 gradient accumulation steps for a total of 3000 steps. ... E.2. Program Synthesis: SFT of Coder The Coder further fine-tuned the Translator with supervised fine-tuning and Lo RA, rank r = 64 and α = 256, to produce Python code from natural language and STL pairs. This was trained with two 6000Ada GPUs using a per-GPU batch size of 8 and 8 gradient accumulation steps for a total of 3000 steps. ... E.3. Reasoning: RLHF of Controller The Controller is trained with DPO (Rafailov et al., 2024) from the Translator checkpoint with Lo RA rank r = 64 and α = 256. We train with two 6000Ada GPUs using a per-GPU batch size of 2 and 4 gradient accumulation steps for a total of 16,800 steps. For DPO, we set β = 0.1, and λ = 1 in Eq. 3.
Software Dependencies	No	The paper mentions using Math Coder2-Deep Seek Math-7B as a base model and Gurobi optimizer, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	E. Training Details ... E.1. Autoformalization: SFT of Translator The Translator was trained with two 6000Ada GPUs using a per-GPU batch size of 16 and 4 gradient accumulation steps for a total of 3000 steps. We fine-tuned the Math Coder2-Deep Seek Math-7B parameter model from (Lu et al., 2024) with Lo RA rank r = 64 and α = 256. ... E.2. Program Synthesis: SFT of Coder ... This was trained with two 6000Ada GPUs using a per-GPU batch size of 8 and 8 gradient accumulation steps for a total of 3000 steps. ... E.3. Reasoning: RLHF of Controller ... We train with two 6000Ada GPUs using a per-GPU batch size of 2 and 4 gradient accumulation steps for a total of 16,800 steps. For DPO, we set β = 0.1, and λ = 1 in Eq. 3.