reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Covariance-based Clustering in Multivariate and Functional Data Analysis

Authors: Francesca Ieva, Anna Maria Paganoni, Nicholas Tarabelloni

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish the eﬀectiveness of our algorithm through applications to both synthetic data and a real data set coming from a biomedical context, showing also how the use of shrinkage estimation may lead to substantially better results. Keywords: Clustering, covariance operator, operator distance, shrinkage estimation, functional data analysis. In this section we provide three simulations involving our proposed clustering method. In Subsection 4.1 we show a ﬁrst example, regarding standard bivariate data, in order to give a clear geometric idea of clustering based on covariance structures. In Subsection 4.2 we show an application to synthetic functional data. In these former two examples the true subdivision of samples is known, so the goodness of the clustering arising from Max Swap algorithm is assessed against the true identities of data. In Subsection 4.3, instead, we apply the clustering algorithm on real functional data expressing the concentration of deoxygenated hemoglobin measured in human subjects brains.
Researcher Affiliation	Academia	Francesca Ieva EMAIL Department of Mathematics F. Enriques Universit a degli Studi di Milano Via Cesare Saldini 50, 20133 Milano, Italy Anna Maria Paganoni EMAIL Nicholas Tarabelloni EMAIL MOX Modeling and Scientiﬁc Computing Department of Mathematics Politecnico di Milano Via Bonardi 9, 20133 Milano, Italy
Pseudocode	Yes	The complete formulation of our Max-Swap algorithm is summarised in Algorithm 1, where we specify for the sake of clarity that the symbol Ik 1 (p), for instance, indicates the p-th element of the set of indexes Ik 1 . Algorithm 1: Max-Swap algorithm Input: Initial guess: I0 1, I0 2 Output: Estimated indexing b I 1 , b I 2 Compute b C0 1, b C0 2 induced by I0 1, I0 2 ; d0 = d b C0 1, b C0 2 ; k = 1; ( d)k = 1; while ( d)k > 0 do for s 1, . . . , K do for t 1, . . . , K do Swap in ﬁrst group: I1 = S p =s Ik 1 1 (p) Ik 1 2 (t); Swap in second group: I2 = S q =t Ik 1 2 (q) Ik 1 1 (s); Compute e C1, e C2 induced by ( I1, I2); Ds,t = d e C1, e C2 ; (s , t ) = arg maxs,t Ds,t; dk = Ds ,t ; ( d)k = dk dk 1; Ik 1 = S p =s Ik 1 1 (p) Ik 1 2 (t ); Ik 2 = S q =t Ik 1 2 (q) Ik 1 1 (s ); k = k + 1; d = dk 1; I 1 = Ik 1 1 ; I 2 = Ik 1 2 ;
Open Source Code	No	The paper does not contain any explicit statement about making the source code available or provide a link to a code repository.
Open Datasets	No	We establish the eﬀectiveness of our algorithm through applications to both synthetic data and a real data set coming from a biomedical context. For synthetic data, the paper describes how the data was generated using specific distributions (e.g., "X = ρx cos θx, sin θx , ρx U [ 1, 1] , θx U π/4, 3π/4"). For the real data set, it states: "In this subsection we apply our clustering method to a real data set belonging to a biomedical context. In particular, we deal with data produced by functional near-infrared spectroscopy (f NIRS)... The measurement and preprocessing techniques as well as the experimental instruments used to collect data are described in (Torricelli et al., 2014; Zucchelli et al., 2013; Re et al., 2013)." These citations are for the techniques and instruments, not for the dataset itself being publicly available. There is no explicit statement or link for public access to the datasets used.
Dataset Splits	No	The paper describes the composition of the datasets used for clustering, which is an unsupervised task. For multivariate data: "a data set D of N = 400 data, according to the previous laws, made up of K = 200 samples from X and K = 200 samples from Y". For synthetic functional data: "The diﬀerent sets have been generated choosing K {20, 25, 30, 35, 40, 45, 50}, corresponding to total cardinalities of N {40, 50, 60, 70, 80, 90, 100}". For real data: "a set of N = 30 signals, subdivided into two groups of K = 15". These describe the overall data used in the experiments for clustering, but do not specify train/test/validation splits typically used in supervised machine learning for reproduction.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper describes mathematical and statistical methods, but it does not specify any particular software or library names with version numbers that were used for implementation or experimentation.
Experiment Setup	Yes	We propose to set J = 1, in order to save computations, and to choose the units to be exchanged by exploring the K2 swaps of one unit from the ﬁrst group with another unit of the second group. Max-Swap algorithm was run for 10 times, keeping the result for which the objective function was highest. Since the number of data in each sub-population, K, is high with respect to their dimensionality, P = 2, we used Max-Swap algorithm in combination with the standard sample estimator of covariance, S. In what follows we considered L = 30.