Covariance-based Clustering in Multivariate and Functional Data Analysis
Authors: Francesca Ieva, Anna Maria Paganoni, Nicholas Tarabelloni
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish the effectiveness of our algorithm through applications to both synthetic data and a real data set coming from a biomedical context, showing also how the use of shrinkage estimation may lead to substantially better results. Keywords: Clustering, covariance operator, operator distance, shrinkage estimation, functional data analysis. In this section we provide three simulations involving our proposed clustering method. In Subsection 4.1 we show a first example, regarding standard bivariate data, in order to give a clear geometric idea of clustering based on covariance structures. In Subsection 4.2 we show an application to synthetic functional data. In these former two examples the true subdivision of samples is known, so the goodness of the clustering arising from Max Swap algorithm is assessed against the true identities of data. In Subsection 4.3, instead, we apply the clustering algorithm on real functional data expressing the concentration of deoxygenated hemoglobin measured in human subjects brains. |
| Researcher Affiliation | Academia | Francesca Ieva EMAIL Department of Mathematics F. Enriques Universit a degli Studi di Milano Via Cesare Saldini 50, 20133 Milano, Italy Anna Maria Paganoni EMAIL Nicholas Tarabelloni EMAIL MOX Modeling and Scientific Computing Department of Mathematics Politecnico di Milano Via Bonardi 9, 20133 Milano, Italy |
| Pseudocode | Yes | The complete formulation of our Max-Swap algorithm is summarised in Algorithm 1, where we specify for the sake of clarity that the symbol Ik 1 (p), for instance, indicates the p-th element of the set of indexes Ik 1 . Algorithm 1: Max-Swap algorithm Input: Initial guess: I0 1, I0 2 Output: Estimated indexing b I 1 , b I 2 Compute b C0 1, b C0 2 induced by I0 1, I0 2 ; d0 = d b C0 1, b C0 2 ; k = 1; ( d)k = 1; while ( d)k > 0 do for s 1, . . . , K do for t 1, . . . , K do Swap in first group: I1 = S p =s Ik 1 1 (p) Ik 1 2 (t); Swap in second group: I2 = S q =t Ik 1 2 (q) Ik 1 1 (s); Compute e C1, e C2 induced by ( I1, I2); Ds,t = d e C1, e C2 ; (s , t ) = arg maxs,t Ds,t; dk = Ds ,t ; ( d)k = dk dk 1; Ik 1 = S p =s Ik 1 1 (p) Ik 1 2 (t ); Ik 2 = S q =t Ik 1 2 (q) Ik 1 1 (s ); k = k + 1; d = dk 1; I 1 = Ik 1 1 ; I 2 = Ik 1 2 ; |
| Open Source Code | No | The paper does not contain any explicit statement about making the source code available or provide a link to a code repository. |
| Open Datasets | No | We establish the effectiveness of our algorithm through applications to both synthetic data and a real data set coming from a biomedical context. For synthetic data, the paper describes how the data was generated using specific distributions (e.g., "X = ρx cos θx, sin θx , ρx U [ 1, 1] , θx U π/4, 3π/4"). For the real data set, it states: "In this subsection we apply our clustering method to a real data set belonging to a biomedical context. In particular, we deal with data produced by functional near-infrared spectroscopy (f NIRS)... The measurement and preprocessing techniques as well as the experimental instruments used to collect data are described in (Torricelli et al., 2014; Zucchelli et al., 2013; Re et al., 2013)." These citations are for the techniques and instruments, not for the dataset itself being publicly available. There is no explicit statement or link for public access to the datasets used. |
| Dataset Splits | No | The paper describes the composition of the datasets used for clustering, which is an unsupervised task. For multivariate data: "a data set D of N = 400 data, according to the previous laws, made up of K = 200 samples from X and K = 200 samples from Y". For synthetic functional data: "The different sets have been generated choosing K {20, 25, 30, 35, 40, 45, 50}, corresponding to total cardinalities of N {40, 50, 60, 70, 80, 90, 100}". For real data: "a set of N = 30 signals, subdivided into two groups of K = 15". These describe the overall data used in the experiments for clustering, but do not specify train/test/validation splits typically used in supervised machine learning for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper describes mathematical and statistical methods, but it does not specify any particular software or library names with version numbers that were used for implementation or experimentation. |
| Experiment Setup | Yes | We propose to set J = 1, in order to save computations, and to choose the units to be exchanged by exploring the K2 swaps of one unit from the first group with another unit of the second group. Max-Swap algorithm was run for 10 times, keeping the result for which the objective function was highest. Since the number of data in each sub-population, K, is high with respect to their dimensionality, P = 2, we used Max-Swap algorithm in combination with the standard sample estimator of covariance, S. In what follows we considered L = 30. |