reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spectral learning of multivariate extremes

Authors: Marco Avella Medina, Richard A Davis, Gennady Samorodnitsky

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work studies the theoretical performance of spectral clustering based on a random k-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical ﬁndings are complemented with numerical experiments illustrating the ﬁnite sample performance of our methods.
Researcher Affiliation	Academia	Marco Avella Medina EMAIL Department of Statistics Columbia University New York, NY, USA Richard A. Davis EMAIL Department of Statistics Columbia University New York, NY, USA Gennady Samorodnitsky EMAIL School of Operations Research and Information Engineering Cornell University Ithaca, NY, USA
Pseudocode	Yes	The spectral clustering algorithm of Ng et al. (2002) proceeds as follows: 1. Compute the ﬁrst m eigenvectors u1, . . . , um of L (i.e., the eigenvectors corresponding to the m smallest eigenvalues of L) and deﬁne an Nn m matrix U using these eigenvectors. 2. Form an Nn m matrix V by normalizing the rows of U to have unit norm. 3. Treating each of the Nn rows of V as a vector in Rm, cluster them into m clusters C1, . . . , Cm using the K-means algorithm. 4. Assign the original points Xi to cluster Cj if and only if row i of the matrix V was assigned to cluster Cj.
Open Source Code	No	The paper does not provide explicit access to source code for the methodology described. It only mentions a license for the paper itself.
Open Datasets	Yes	We revisit the data analyzed by Heﬀernan and Tawn (2004) and Janßen and Wan (2020). It is available in the R package texmex and consists of daily measurements of ﬁve air pollutants in the city of Leeds, UK.
Dataset Splits	Yes	It was collected between 1994 and 1998, and split into summer and winter months yielding a total of 578 and 532 observations respectively. Following standard practice in multivariate extremes data analysis we standardize the marginal distribution of the data to focus on the extremal dependence. More speciﬁcally, we transform the marginals of the original observations Xi as in Janßen and Wan (2020) i.e., we let Yij := 1/{1 Fnj(Xij)}, where Fnj(x) = 1n Pn i=1 1(Xi x) denotes the jth marginal empirical cumulative distribution function, x R and j = 1, . . . , d. We then proceed to deﬁne the extremal observations as the 10% of the transformed observations {Yi} with largest Euclidean norm and analyze their angular components with our algorithm. We consider sample sizes n = {1000, 5000, 25000, 125000} and take a sample of extremes corresponding to observations whose Euclidean norm is larger or equal to the following vector of corresponding sample quantiles: β = {0.9, 0.96, 0.984, 0.9968}, respectively. These quantiles were chosen to lead to samples of extremes of sizes Nn = {100, 200, 400, 800}, correspondingly.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions the use of the 'R package texmex' but does not provide specific version numbers for R or the package.
Experiment Setup	Yes	In all the examples considered below we compute weighted adjacency matrices using the exponential kernel d(x, y) = exp( s x y ) with s = 1 and select the number of clusters as suggested by the screeplots of the fully connected weighted adjacency matrices W. It matched nicely the correct number of clusters, when known. We consider sample sizes n = {1000, 5000, 25000, 125000} and take a sample of extremes corresponding to observations whose Euclidean norm is larger or equal to the following vector of corresponding sample quantiles: β = {0.9, 0.96, 0.984, 0.9968}, respectively. These quantiles were chosen to lead to samples of extremes of sizes Nn = {100, 200, 400, 800}, correspondingly. For these extremes we deﬁne kn-nearest neighbors graphs with kn = Nn C log Nn + 1, the corresponding values of the constants are in the vector C = {3, 5, 7, 9} . Based on the screeplots, we used 2 clusters in the noiseless case and 3 clusters in the case with noise.