reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simulate, Refine and Integrate: Strategy Synthesis for Efficient SMT Solving

Authors: Bingzhe Zhou, Hannan Wang, Yuan Yao, Taolue Chen, Feng Xu, Xiaoxing Ma

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations show the superior performance of SIRISMT over the baseline methods. For example, compared to the default Z3, it solves 26.8% more formulas and achieves up to an 86.3% improvement in the Par-2 score on benchmark datasets. Additionally, we show that the synthesized strategy can improve the code coverage by up to 11.8% in a downstream symbolic execution benchmark.
Researcher Affiliation	Academia	1State Key Laboratory of Novel Software Technology, Nanjing University, China 2School of Computing and Mathematical Sciences, Birkbeck, University of London, UK EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes in detail, but it does not include a distinct section or figure explicitly labeled as "Pseudocode" or "Algorithm" with structured, code-like steps.
Open Source Code	Yes	The source code, dataset and the extended version are available at: https://github.com/Soft Wiser-group/Siri SMT
Open Datasets	Yes	The source code, dataset and the extended version are available at: https://github.com/Soft Wiser-group/Siri SMT... Datasets We evaluate the effectiveness of SIRISMT on five SMT benchmark datasets, namely, core [Barrett et al., 2016b], Sage2 [Barrett et al., 2016e], leipzig [Barrett et al., 2016d], hycomp [Barrett et al., 2016c], and APro VE [Barrett et al., 2016a], across three theories of QFBV (Quantifier Free Bit-Vector), QFNIA (Quantifier-Free Nonlinear Integer Arithmetic), and QFNRA (Quantifier-Free Nonlinear Real Arithmetic). ... We use coreutils (version 8.31) and two real-world programs (Make and Gawk) as datasets, which are standard benchmarks for evaluating the performance of symbolic execution techniques [Cadar et al., 2008].
Dataset Splits	Yes	In particular, two experiments were conducted: 1) 51 randomly selected coreutils programs as the training set, with the remaining 52 used for testing; 2) strategies trained on the Sage2 dataset were tested on all coreutils programs.
Hardware Specification	Yes	All experiments are conducted on a server equipped with an Intel Core i9-12900KF CPU (12th Generation, 24 cores) and an RTX 3090 GPU.
Software Dependencies	Yes	We implement SIRISMT in Py Torch [Paszke et al., 2017], represent SMT formulas in the SMT-LIB2 format [Barrett et al., 2016f], and use py SMT [pys, 2014] to parse formulas into ASTs. ... We use Z3 version 4.12.2 for our experiments. For symbolic execution, we use KLEE [Cadar et al., 2008] (version 3.1) on LLVM 13.
Experiment Setup	Yes	In our RL framework, we run five iterations with different random seeds for each training formula to obtain sufficient candidate tactic sequences. During the refinement stage, tactic sequence reduction and augmentation are repeated twice for efficiency, and the integration stage uses a maximum tree depth of 20 in the decision tree algorithm. For both training and testing phases, we adopt the 10-second timeout setting following fast SMT, with all experiments run using Z3.