reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generating Likely Counterfactuals Using Sum-Product Networks

Authors: Jiří Němeček, Tomáš Pevný, Jakub Marecek

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first train a basic feed-forward Neural Network (NN) classifier with 2 hidden layers with Re LU activations. One could easily use one of the variety of ML models that can be formulated using MIO, including linear models, (gradient-boosted) trees, forests, or graph neural networks. Secondly, we train an SPN to model the likelihood on the same training dataset. We include the class y of a sample x in the training since we have prior knowledge of the counterfactual class. SPNs have a variety of training methods (Xia et al., 2023), of which we use a variant of Learn SPN (Gens & Domingos, 2013) implemented in the SPFlow library (Molina et al., 2019), though newer methods exist (e.g., Trapp et al., 2019). Data We tested on the Give Me Some Credit (GMSC) dataset (Fusion & Cukierski, 2011), the Adult dataset (Becker & Kohavi, 1996) and the German Credit (referred to as Credit) dataset (Hofmann, 1994). We dropped some outlier data and some less informative features (details in Section D) and performed all experiments in a 5-fold cross-validation setting.
Researcher Affiliation	Academia	Jiˇr ı Nˇemeˇcek, Tom aˇs Pevn y & Jakub Mareˇcek Department of Computer Science Faculty of Electrical Engineering, Czech Technical University Karlovo n amˇest ı 13, Praha 2, 121 35 EMAIL
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It provides mathematical formulations (e.g., MIO formulations) but not in a structured pseudocode format.
Open Source Code	Yes	The source code with examples is available at https://github.com/Epanemu/Li CE. The entire implementation, together with the data, is available at https: //github.com/Epanemu/Li CE.
Open Datasets	Yes	Data We tested on the Give Me Some Credit (GMSC) dataset (Fusion & Cukierski, 2011), the Adult dataset (Becker & Kohavi, 1996) and the German Credit (referred to as Credit) dataset (Hofmann, 1994). GMSC We do not remove any feature in GMSC, but we keep only data with reasonable values to avoid numerical issues within MIO. The thresholds for keeping the sample are as follows Monthly Income < 50000 Revolving Utilization Of Unsecured Lines < 1 Number Of Time30-59Days Past Due Not Worse < 10 Debt Ratio < 2 Number Of Open Credit Lines And Loans < 40 Number Of Times90Days Late < 10 Number Real Estate Loans Or Lines < 10 Number Of Time60-89Days Past Due Not Worse < 10 Number Of Dependents < 10 this removes around 5.5% of data after data with missing values was removed. We could combat the same issues by taking a log of some of the features. In our pruned GMSC dataset, there are 113,595 samples and 10 features, none of which are categorical, 7 are discrete contiguous, and the remaining 3 are real continuous. Further details are in the preprocessing code. Adult In the Adult dataset, we remove 5 features fnlwgt which equals the estimated number of people the data sample represents in the census, and is thus not actionable and difficult to obtain for new data, making it less useful for predictions, education-num because it can be substituted by ordinal feature education, native-country because it is again not actionable, less informative, and also heavily imbalanced, capital-gain and capital-loss because they contain few non-zero values. It is not uncommon to remove the features we did, as some of them also have many missing values. We remove only about 2% of the data by removing samples with missing values. We are left with 47,876 samples and 9 features, 5 of which are categorical, 1 is binary, 1 ordinal, and the remaining 2 are discrete contiguous. Further details are in the preprocessing code. Credit We do not remove any samples or features for the Credit dataset. The dataset contains 1,000 samples and 20 features, 10 of which are categorical, 2 are binary, 1 ordinal, 5 are discrete contiguous, and the remaining 2 are real continuous. Further details are in the preprocessing code.
Dataset Splits	Yes	We dropped some outlier data and some less informative features (details in Section D) and performed all experiments in a 5-fold cross-validation setting.
Hardware Specification	Yes	Most experiments ran on a personal laptop with 32GB of RAM and 16 CPUs AMD Ryzen 7 PRO 6850U, but since the proposed methods had undergone wider experimentation, their experiments were run on an internal cluster with assigned 32GB of RAM and 16 CPUs, some AMD EPYC 7543 and some Intel Xeon Scalable Gold 6146, based on their availability.
Software Dependencies	No	MIO and Li CE are implemented using the open-source Pyomo modeling library (Bynum et al., 2021) that allows for the simple use of (almost) any MIO solver. We use the Gurobi solver (Gurobi Optimization, LLC, 2024). We encode the classification model using the OMLT library (Ceccon et al., 2022), which simplifies the formulation of various ML models, although we focus on Neural Networks. SPNs have a variety of training methods (Xia et al., 2023), of which we use a variant of Learn SPN (Gens & Domingos, 2013) implemented in the SPFlow library (Molina et al., 2019), though newer methods exist (e.g., Trapp et al., 2019). We utilize the bnlearn Python library (Taskesen, 2020) and select 7 Bayesian Networks of varying sizes from the Bayesian Network Repository (Scutari, 2010) (namely asia,
Experiment Setup	Yes	Neural Network We compare methods on a neural network with four layers, first with a size equal to the length of the encoded input, then 20 and 10 for hidden layers, and a single neuron as output. It trained with batch size 64 for 50 epochs. MIO and Li CE: For our methods, we configure a time limit of 2 minutes for MIO solving. These are high enough for MIO, but constrained Li CE struggles with increasing likelihood requirements. We generate 10 closest CEs, not using the relative distance parameter. We set the decision margin τ = 10 4 and we use one ϵj = 10 4 for all features j because they are normalized. In the SPNs, we use T LL n = 100 as a safe upper bound though this could be computed more tightly for an individual sum node. We choose δSPN equal to the median (or lower quartile) of likelihood on the dataset. For Li CE (optimize), we used α = 0.1 since features are normalized to [0, 1] and log-likelihood often takes values in the [ 100, 10] range.