Optimal Decision Trees for Interpretable and Constrained Clustering
Authors: Pouya Shati, Yuliang Song, Eldan Cohen, Sheila A. McIlraith
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments involving a range of real-world and synthetic datasets demonstrate that our approach can produce interpretable clustering solutions that are of superior quality compared to their non-interpretable counterparts, with or without the addition of constraints. We further provide new insights into the trade-off between interpretability and the satisfaction of user-specified constraints, presenting extensions to our clustering approach that treat the satisfaction of constraints as an additional optimization objective. |
| Researcher Affiliation | Academia | POUYA SHATI, Department of Computer Science, University of Toronto, Canada and Vector Institute for Artificial Intelligence, Canada YULIANG SONG, Department of Mechanical and Industrial Engineering, University of Toronto, Canada ELDAN COHEN, Department of Mechanical and Industrial Engineering, University of Toronto, Canada SHEILA A. MCILRAITH, Department of Computer Science, University of Toronto, Canada and Vector Institute for Artificial Intelligence, Canada |
| Pseudocode | Yes | Algorithm 1 Smart Pairs |
| Open Source Code | No | The paper states: "Our approach is implemented in Python and Java and we use the Loandra solver (Berg, Demirović, et al. 2019) to solve the encodings that we generate." This describes the implementation details and tools used, but does not include an explicit statement or link confirming that the authors' own source code for the methodology is openly available or will be released. |
| Open Datasets | Yes | Our benchmarks include seven real datasets from the UCI repository (Dua and Graff 2017) and four synthetic datasets from FCPS (Ultsch and Lötsch 2020). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It mentions using 'ground-truth labels' for evaluation and that 'each experiment is run with 20 different seeds', but no specific methodology for partitioning the datasets into distinct training, testing, or validation sets is described in the context of the clustering problem. |
| Hardware Specification | Yes | We run experiments on a server with two 12-core Intel E5-2697v2 CPUs and 128G of RAM. |
| Software Dependencies | Yes | Our approach is implemented in Python and Java and we use the Loandra solver (Berg, Demirović, et al. 2019) to solve the encodings that we generate. ... We have implemented the model using the Gurobi v10 solver. |
| Experiment Setup | Yes | We set a time limit of 30 minutes every time a call to the solver is made, a feasible solution is salvaged in case of a timeout if possible. ... For each dataset, we fix the depth value to the lowest number that provides more than twice as many leaves as there are clusters. ... Specifically, for a given 𝜅value with 0 𝜅 |𝑋| 1 2 , we generate a set of 𝜅 |𝑋| random pairwise constraints following Wagstaff and Cardie 2000: (1) We generate 𝜅 |𝑋| pairs of data points without repetition; (2) We add each pair to the must-links (𝑀𝐿) constraint set if both data points share the same ground-truth label and to the cannot-links (𝐶𝐿) set otherwise. To mitigate the potential bias due to the random generation of constraints, each experiment is run with 20 different seeds with the overall mean being reported. ... we solve Problem 1 with 𝜖= 0.1 for ML and CL sets generated with various 𝜅values. |