HiClass: a Python Library for Local Hierarchical Classification Compatible with Scikit-learn
Authors: Fábio M. Miranda, Niklas Köhnecke, Bernhard Y. Renard
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 2, we compare the hierarchical F-score, computational resources (measured with the command time) and disk usage. This comparison was performed between two flat classifiers from the library scikit-learn and Microsoft's Light GBM (Ke et al., 2017) versus the local hierarchical classifiers implemented in Hi Class. In order to avoid bias, cross-validation and hyperparameter tuning were performed on the local hierarchical classifiers and flat classifiers. For comparison purposes, we used a snapshot from 02/11/2022 of the consumer complaints data set provided by the Consumer Financial Protection Bureau of the United States (Bureau and General, 2022). |
| Researcher Affiliation | Academia | Fabio M. Miranda EMAIL Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany Niklas K ohnecke EMAIL Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany Bernhard Y. Renard EMAIL Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany |
| Pseudocode | No | The paper describes the algorithms for Local Classifier Per Node, Local Classifier Per Parent Node, and Local Classifier Per Level in Appendix C using descriptive text and figures, and defines training policies in tables, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code and documentation are available at https://github.com/scikit-learn-contrib/hiclass. |
| Open Datasets | Yes | For comparison purposes, we used a snapshot from 02/11/2022 of the consumer complaints data set provided by the Consumer Financial Protection Bureau of the United States (Bureau and General, 2022), which after preprocessing contained 727,495 instances for cross-validation and hyperparameter tuning as well as training and 311,784 more for validation. |
| Dataset Splits | Yes | First the data set was split with 70% of the data being used for hyperparameter tuning and training, while 30% was held for a final evaluation. The subset with 70% of data held for training was further split into 5 subsets for 5-fold cross-validation and identification of best hyperparameter combination. |
| Hardware Specification | Yes | The benchmark was computed on multiple cluster nodes running GNU/Linux with 512 GB physical memory and 128 cores provided by two AMD EPYC 7742 processors. |
| Software Dependencies | No | The paper mentions 'Packages for Python 3.7-3.9' and various libraries like 'scikit-learn', 'NumPy', 'NetworkX', 'Ray', 'Joblib', 'Hydra', 'Optuna', and 'Light GBM'. While Python has a version range, specific version numbers for the other key software components used in the methodology are not explicitly provided in the text. |
| Experiment Setup | Yes | For hyperparameter tuning, the models were trained using 4 folds as training data and validated on the remaining one. This process was repeated 5 times, with different folds combinations being used in each iteration, and the average hierarchical F-score was reported as the performance metric. The selection of the best hyperparameters was assisted by Hydra (Meta, 2022) and its plugin Optuna (Akiba et al., 2019), through a grid search using the combinations of hyperparameters described in Tables 2-4. |