reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OpenEnsembles: A Python Resource for Ensemble Clustering

Authors: Tom Ronan, Shawn Anastasio, Zhijie Qi, Pedro Henrique S. Vieira Tavares, Roman Sloutsky, Kristen M. Naegle

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have documented examples of using Open Ensembles to create, analyze, and visualize a number of diﬀerent types of ensemble approaches on toy and example datasets. Figure 1: Code and outputs of Open Ensembles. An ensemble approach (Fred, 2001) to ﬁnd stable and optimal solutions from the combination of many solutions of the non-deterministic k-means algorithm with majority vote. As one can see from the determinant ratio index plot in Figure 1, as the number of clustering solutions in the ensemble increases, the solution both (a) stabilizes and (b) correctly identiﬁes the number of inherent, connected clusters within the data.
Researcher Affiliation	Academia	Department of Biomedical Engineering and the Center for Biological Systems Engineering Washington University in St. Louis St. Louis, MO 63122, USA. Department of Computer Science University of Arizona Tucson, AZ 85721, USA.
Pseudocode	Yes	1 import pandas as pd 2 from sklearn import datasets 3 import openensembles as oe 4 #Set up a dataset and put in pandas Data Frame. 5 x, y = datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None) 6 df = pd.Data Frame(x) 7 #instantiate the oe data object 8 data Obj = oe.data(df, [1,2]) 9 #instantiate an oe clustering object 10 c = oe.cluster(data Obj) 11 c_MV_arr = [] 12 val_arr = [] 13 for i in range(0,39): 14 # add a new clustering solution, with a unique name 15 name = 'kmeans_' + str(i) 16 c.cluster('parent', 'kmeans', name, K=16, init = 'random', n_init = 1) 17 # calculate a new majority vote solution, where c has one more solution on each iteration 18 c_MV_arr.append(c.finish_majority_vote(threshold=0.5)) 19 #calculate the determinant ratio metric for each majority vote solution 20 v = oe.validation(data Obj, c_MV_arr[i]) 21 val_name = v.calculate('det_ratio', 'majority_vote', 'parent') 22 val_arr.append(v.validation[val_name]) 23 #calculate the co-occurrence matrix 24 co Mat = c.co_occurrence_matrix()
Open Source Code	Yes	Open Ensembles is released under the GNU General Public License version 3, can be installed via Conda or the Python Package Index (pip), and is available at https://github.com/Naegle Lab/Open Ensembles.
Open Datasets	Yes	documented examples of using Open Ensembles to create, analyze, and visualize a number of diﬀerent types of ensemble approaches on toy and example datasets. x, y = datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None)
Dataset Splits	No	The paper uses `datasets.make_moons(n_samples=200, shuffle=True, noise=0.02, random_state=None)` to generate data. It applies clustering to this dataset but does not specify any training/test/validation splits for experiment reproduction in the conventional sense of supervised learning.
Hardware Specification	No	No specific hardware details (GPU models, CPU models, or cloud resources) are mentioned in the paper.
Software Dependencies	No	The paper lists software dependencies: 'scikit-learn (Pedregosa et al., 2012), Pandas (Mc Kinney, 2010), Matplotlib (Hunter, 2007), Network X (Hagberg et al., 2008), and Num Py (Walt et al., 2011).' However, it does not provide specific version numbers for these software components.
Experiment Setup	Yes	c.cluster('parent', 'kmeans', name, K=16, init = 'random', n_init = 1)