reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Practical Classical Planning Compilations of Numeric Planning

Authors: Luigi Bonassi, Francesco Percassi, Enrico Scala

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental Analysis The experimental analysis assesses the capability of classical planners in solving numeric planning tasks and compares their results with those of state-of-the-art numeric planners. We aim to investigate the extent to which classical planners can perform when using the different compilations and determine which encoding performs best in practice. Therefore, we have implemented the compilations in Python using the unified planning library (Micheli et al. 2025). As benchmarks, we considered all the SNP domains from the latest International Planning Competition (Taitler et al. 2024), except SETTLERS. For SETTLERS, we used the SNP formulation from Scala et al. (2020), where all numeric variables are initialised in the initial state. In the original formulation, some variables were undefined in the initial state and initialised by actions, which makes the problem not SNP. Currently, we do not support tasks that are neither SNP nor involve undefined numeric variables. Similarly to what was done by Cardellini, Giunchiglia, and Maratea (2024), and to understand the behaviour of the compilations based on the structure of numeric problems, we considered a measure of the numericity of each RT0 Π calculated as N(Π) = \|N \| \|N \|+\|F\|. When N(Π) > 0.5, Π is strongly numeric (SN) as there are more numeric variables than Boolean ones; otherwise, it is mildly numeric (MN). As classical planners (the target of our compilation), we considered several options from previous competitions, focusing on LAMA (Richter and Westphal 2010) (specifically, stopping at the first solution, denoted as LAMA-FIRST), as it is a planner that supports axioms and also the bestperforming system over our instances. On the numeric side, we tested a wide range of different planners PATTY (Cardellini, Giunchiglia, and Maratea 2024), ENHSP (Scala et al. 2020), and NLM-CUTPLAN (NLM) (Kuroiwa, Shleyfman, and Beck 2023b) selecting a high-performing configuration for each. For PATTY, we used the default configuration. For ENHSP, we adopted the M(3h\|\|3n) configuration (https://github.com/ hstairs/jpddlplus/tree/socs24-width-and-mq) from the work by Chen and Thi ebaux (2024). For NLM, we used the SAT2 configuration (Kuroiwa, Shleyfman, and Beck 2022; Shleyfman, Kuroiwa, and Beck 2023), which is a lazy greedy bestfirst search paired with an admissible heuristic. In our experiments, each test run with a classical planner involves a compilation method and a planning task from our benchmark suite. Specifically, each run first normalises the input SNP task into an RT0, then compiles it, and finally provides the output to the selected off-the-shelf planner. Since all compilations require the size of the finite integer domain such as κ bits for the BLAST-based compilations and the range D for ONE-HOT we had to choose the appropriate size for each domain. For each domain, we identified the largest constant K from the actions, initial state, and goal description, setting κ = log2(K) + 1 for all compiled planning tasks. To ensure a fair comparison, ONE-HOT used D = Zκ, while numeric planners were evaluated on the original numeric instances. We allocate a budget of 1,800 seconds of runtime and 8 GB of memory (for normalisation, compilation, and solving). The experiments were conducted on an Intel Xeon Gold 6140M CPU with a clock speed of 2.30 GHz. Table 1 shows the coverage (number of solved instances) achieved by LAMA-FIRST on instances compiled by ONEHOT (O), BLAST (B) and BLASTX (BX ), compared to PATTY (P), ENHSP with the configuration M(3h\|\|3n) (M) and NLM with the configuration SAT2 (S). The table is divided into two subtables, each corresponding to a class of domains: SN and MN. We list the number of bits used to represent the numeric variables (κ) and the average numericity for each domain (N). As the table shows, BLASTX outperforms the other compilations, achieving the highest coverage across all domains. This finding suggests that using axioms is highly advantageous, even though the planner must evaluate numerous axioms to update the bits for each numeric variable at every search step. Interestingly, ONE-HOT and BLAST achieve comparable coverage. Regarding the comparison between compilation-based and native systems, it is noteworthy that while BLASTX performs poorly in strongly numeric domains (the top part of the table), its performance in MN is comparable to, and at times even exceeds, that of the numeric planners we tested. Indeed, BLASTX is the runner-up in the MN tasks. In these domains, the gap between BLASTX and M(3h\|\|3n) is much smaller compared to the SN domains. Specifically, and somewhat unsurprisingly, BLASTX excels in the DELIVERY domain, which has only a few numeric variables. Unexpectedly, it also excels in SETTLERS, solving three-fourths of the instances. Although SETTLERS is a challenging numeric planning problem, the Boolean structure provides sufficient information for LAMA-FIRST to better guide the search. Figure 2 shows the coverage over time for all systems, separately for SN and MN domains. Focusing on MN domains, naturally, compilation-based approaches are slower than native numeric planners due to the overhead of the compilation time. Indeed, BLASTX becomes the runner-up performer after roughly 200 seconds. We also collected data on the number of expanded nodes by LAMA-FIRST combined with BLASTX and numeric planners based on forward search, i.e., M(3h\|\|3n) and NLM SAT2 (Figure 1). This is a measure of the informedness of the heuristics. For SN instances, both numeric planners expand fewer nodes, while for the MN instances, LAMA-FIRST expands fewer or comparable nodes. The separation between SN and MN instances is even more pronounced in the comparison with NLM SAT2. The results suggest that a simple criterion based on the numericity of the planning task to be solved can guide the choice between using numeric or classical planners. We also tested our compilations with SYMK (Speck et al. 2019), a state-of-the-art optimal planner supporting axioms. Overall, SYMK solved 23 problems with BLASTX , 15 with BLAST, and 0 with ONE-HOT. For comparison, the optimal configuration ORBIT for NLM solves a total of 86 problems. It is worth noting that with compilation approaches, the optimal plans may vary depending on the choice of κ. Therefore, these results should be seen as indicative rather than a fair comparison for optimal planning.
Researcher Affiliation	Academia	Luigi Bonassi1, Francesco Percassi2*, Enrico Scala1 1Dipartimento di Ingegneria dell Informazione, Universit a degli Studi di Brescia, Italy 2School of Computing and Engineering, University of Huddersfield, United Kingdom EMAIL,EMAIL,EMAIL
Pseudocode	No	The paper defines concepts like Full Adder in Definition 1 and describes compilation steps using mathematical formulas and prose, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code https://github.com/LBonassi95/Bit Blast
Open Datasets	Yes	As benchmarks, we considered all the SNP domains from the latest International Planning Competition (Taitler et al. 2024), except SETTLERS. For SETTLERS, we used the SNP formulation from Scala et al. (2020)
Dataset Splits	No	The paper refers to "SNP domains from the latest International Planning Competition (Taitler et al. 2024)" as benchmarks, which are collections of problem instances, but it does not specify training/test/validation dataset splits typically used in machine learning for model evaluation.
Hardware Specification	Yes	The experiments were conducted on an Intel Xeon Gold 6140M CPU with a clock speed of 2.30 GHz.
Software Dependencies	No	The paper states "we have implemented the compilations in Python using the unified planning library (Micheli et al. 2025)", but it does not provide specific version numbers for Python or the unified planning library.
Experiment Setup	Yes	We allocate a budget of 1,800 seconds of runtime and 8 GB of memory (for normalisation, compilation, and solving). The experiments were conducted on an Intel Xeon Gold 6140M CPU with a clock speed of 2.30 GHz... For PATTY, we used the default configuration. For ENHSP, we adopted the M(3h\|\|3n) configuration (https://github.com/ hstairs/jpddlplus/tree/socs24-width-and-mq) from the work by Chen and Thi ebaux (2024). For NLM, we used the SAT2 configuration (Kuroiwa, Shleyfman, and Beck 2022; Shleyfman, Kuroiwa, and Beck 2023), which is a lazy greedy bestfirst search paired with an admissible heuristic. For each domain, we identified the largest constant K from the actions, initial state, and goal description, setting κ = log2(K) + 1 for all compiled planning tasks.