reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Optimal Decision Tree Pruning Revisited: Algorithms and Complexity

Authors: Juha Harviainen, Frank Sommer, Manuel Sorge, Stefan Szeider

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a comprehensive classical and parameterized complexity analysis of decision tree pruning operations, extending recent research on the complexity of learning small decision trees. Thereby, we offer new insights into the computational challenges of decision tree simplification, a crucial aspect of developing interpretable and efficient machine learning models. We focus on fundamental pruning operations of subtree replacement and raising, which are used in heuristics. Surprisingly, while optimal pruning can be performed in polynomial time for subtree replacement, the problem is NP-complete for subtree raising. Therefore, we identify parameters and combinations thereof that lead to fixed-parameter tractability or hardness, establishing a precise borderline between these complexity classes. For example, while subtree raising is hard for small domain size D or number d of features, it can be solved in D2d \|I\|O(1) time, where \|I\| is the input size. We complement our theoretical findings with preliminary experimental results, demonstrating the practical implications of our analysis.
Researcher Affiliation	Academia	1Department of Computer Science, University of Helsinki, Helsinki, Finland 2Institute of Logic and Computation, TU Wien, Austria. Correspondence to: Juha Harviainen <EMAIL>, Frank Sommer <EMAIL>, Manuel Sorge <EMAIL>, Stefan Szeider <EMAIL>.
Pseudocode	No	The paper describes algorithms using prose and references theorems and proofs, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, code-like formatting.
Open Source Code	Yes	1A continuously updated version of our paper is available on ar Xiv (Harviainen et al., 2025a) and the related source code for replicating the experiments on Zenodo (Harviainen et al., 2025b).
Open Datasets	Yes	We used 40 datasets from the Penn Machine Learning Benchmarks library (Romano et al., 2022), including 32 previously used for minimum-size tree computation (Bessiere et al., 2009; Narodytska et al., 2018; Staus et al., 2025)6 and additional larger datasets.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits or cross-validation details for the experiments conducted by the authors. It mentions using pre-existing datasets and WEKA for generating trees, but doesn't specify how the data was partitioned for their analysis.
Hardware Specification	Yes	We ran the implementation under Ubuntu Linux 18.04 on a compute cluster with Intel Xeon E5-2640 processors, setting a maximum RAM limit of 64GB.
Software Dependencies	Yes	We computed unpruned and pruned trees using WEKA 3.8.5 s (Frank et al., 2010) C4.5 implementation (Quinlan, 1993)... We implemented a dynamic-programming algorithm for solving DTRAIS= based on Theorem 4.2 in Python, tested with versions 3.6.9 and 3.10.12.
Experiment Setup	Yes	The unpruned trees were obtained by running WEKA s J48 classifier with the flags -no-cv -B -M 0 -J -U. We obtained two types of pruned trees: Those obtained by the replacement heuristic implemented in J48 when run with the flags -no-cv -B -M 0 -J -S and those obtained by the raising heuristic, corresponding to the flags -no-cv -B -M 0 -J.