OpenEnsembles: A Python Resource for Ensemble Clustering
Authors: Tom Ronan, Shawn Anastasio, Zhijie Qi, Pedro Henrique S. Vieira Tavares, Roman Sloutsky, Kristen M. Naegle
JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have documented examples of using Open Ensembles to create, analyze, and visualize a number of different types of ensemble approaches on toy and example datasets. Figure 1: Code and outputs of Open Ensembles. An ensemble approach (Fred, 2001) to find stable and optimal solutions from the combination of many solutions of the non-deterministic k-means algorithm with majority vote. As one can see from the determinant ratio index plot in Figure 1, as the number of clustering solutions in the ensemble increases, the solution both (a) stabilizes and (b) correctly identifies the number of inherent, connected clusters within the data. |
| Researcher Affiliation | Academia | Department of Biomedical Engineering and the Center for Biological Systems Engineering Washington University in St. Louis St. Louis, MO 63122, USA. Department of Computer Science University of Arizona Tucson, AZ 85721, USA. |
| Pseudocode | Yes | 1 import pandas as pd 2 from sklearn import datasets 3 import openensembles as oe 4 #Set up a dataset and put in pandas Data Frame. 5 x, y = datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None) 6 df = pd.Data Frame(x) 7 #instantiate the oe data object 8 data Obj = oe.data(df, [1,2]) 9 #instantiate an oe clustering object 10 c = oe.cluster(data Obj) 11 c_MV_arr = [] 12 val_arr = [] 13 for i in range(0,39): 14 # add a new clustering solution, with a unique name 15 name = 'kmeans_' + str(i) 16 c.cluster('parent', 'kmeans', name, K=16, init = 'random', n_init = 1) 17 # calculate a new majority vote solution, where c has one more solution on each iteration 18 c_MV_arr.append(c.finish_majority_vote(threshold=0.5)) 19 #calculate the determinant ratio metric for each majority vote solution 20 v = oe.validation(data Obj, c_MV_arr[i]) 21 val_name = v.calculate('det_ratio', 'majority_vote', 'parent') 22 val_arr.append(v.validation[val_name]) 23 #calculate the co-occurrence matrix 24 co Mat = c.co_occurrence_matrix() |
| Open Source Code | Yes | Open Ensembles is released under the GNU General Public License version 3, can be installed via Conda or the Python Package Index (pip), and is available at https://github.com/Naegle Lab/Open Ensembles. |
| Open Datasets | Yes | documented examples of using Open Ensembles to create, analyze, and visualize a number of different types of ensemble approaches on toy and example datasets. x, y = datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None) |
| Dataset Splits | No | The paper uses `datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None)` to generate data. It applies clustering to this dataset but does not specify any training/test/validation splits for experiment reproduction in the conventional sense of supervised learning. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, or cloud resources) are mentioned in the paper. |
| Software Dependencies | No | The paper lists software dependencies: 'scikit-learn (Pedregosa et al., 2012), Pandas (Mc Kinney, 2010), Matplotlib (Hunter, 2007), Network X (Hagberg et al., 2008), and Num Py (Walt et al., 2011).' However, it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | c.cluster('parent', 'kmeans', name, K=16, init = 'random', n_init = 1) |