reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpretable and Fair Boolean Rule Sets via Column Generation

Authors: Connor Lawless, Sanjeeb Dash, Oktay Gunluk, Dennis Wei

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the performance of our approach we present a suite of numerical results. We start with describing the data sets, experimental setup, and computational environment that we used. In Section 5.2, we present an empirical study aimed at understanding the effect of using Hamming loss as a proxy for 0-1 loss. We benchmark our approach against state-of-the-art algorithms in both the traditional classiﬁcation setting (Section 5.3), and the fairness setting (Section 5.4).
Researcher Affiliation	Collaboration	Connor Lawless1 EMAIL Sanjeeb Dash2 EMAIL Oktay G unl uk1 EMAIL Dennis Wei2 EMAIL 1 Operations Research and Information Engineering, Cornell University, Ithaca, NY, 14850 2 IBM Research, Yorktown Heights, NY, 10598
Pseudocode	No	The paper describes the column generation procedure and Pricing Problem in prose and mathematical formulations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' section with structured, step-by-step instructions. Figure 1 is a flowchart, not pseudocode.
Open Source Code	No	The paper mentions publicly available code for benchmark methods like 'Fair CORELS (Aıvodji et al., 2021)' and 'BRS' but does not explicitly state that the authors are releasing the source code for their own methodology described in this paper.
Open Datasets	Yes	Evaluations were conducted on 15 classiﬁcation data sets from the UCI repository (Dua and Karra Taniskidou, 2017) that have been used in recent works on rule set/Boolean classiﬁers (Malioutov and Varshney, 2013; Dash et al., 2014; Su et al., 2016; Wang et al., 2017). In addition, we used data from the FICO Explainable Machine Learning Challenge (FICO). For all three data sets we include the sensitive attribute as a feature for prediction as their exclusion is both not enough to ensure fairness and may lead to worse fairness outcomes (see Corbett-Davies and Goel, 2018, for a discussion). For the compas data from Pro Publica (Angwin et al., 2016) we use the fair machine learning cleaned data set from Kaggle.
Dataset Splits	Yes	Test performance on all data sets is estimated using 10-fold stratiﬁed cross-validation (CV). Unless otherwise stated, all hyperparameters were tuned using nested 10-fold cross-validation. All tables in this section are grouped by problem size in increasing order (i.e., ﬁrst section corresponds to small data sets, the second to medium and large data sets).
Hardware Specification	Yes	All results in this section were obtained on a personal computer with a 3 GHz processor and 16 GB of RAM. All the linear and integer programs were solved using Gurobi 9.0 (Gurobi Optimization, 2020). All results in this section were obtained using a single 2.0 GHz core of a server with 64 GB of memory (only a small fraction of which was used). All results in this section were run on a personal computer with a 3 GHz processor and 16 GB of RAM. All the linear and integer programs were solved using Gurobi 9.0 (Gurobi Optimization, 2020).
Software Dependencies	Yes	All the linear and integer programs were solved using Gurobi 9.0 (Gurobi Optimization, 2020). All linear and integer programs were solved using CPLEX 12.7.1 (Cplex, 2009). Additional comparisons include the WEKA (Frank et al., 2016) JRip implementation of RIPPER (Cohen, 1995), a rule set learner that is still state-of-the-art in accuracy, and the scikit-learn (Pedregosa et al., 2011) implementations of the decision tree learner CART (Breiman et al., 1984) and Random Forests (RF) (Breiman, 2001).
Experiment Setup	Yes	For CG, we used an overall time limit of 300 seconds for training and a time limit of 45 seconds for solving the Pricing Problem in each iteration. In these experiments, CG was given an overall time limit of 120 seconds for each candidate value of C and the time limit for the Pricing Problem was set to 30 seconds. To offset the decrease in the time limit, we performed a second pass for each data set, solving the restricted MIP with all the rules generated for all possible choices of C. Mean test accuracy (over 10 partitions) and rule set complexity are reported in Tables 2 and 3. For BRS, we ﬁxed κ = 1 as optimizing κ did not improve accuracy on the whole (as can be expected from Figure 4). Tables 2 and 3 also include results from RIPPER, CART, and RF. We tuned the minimum number of samples per leaf for CART and RF, used 100 trees for RF, and otherwise kept the default settings.