Spectral learning of multivariate extremes

Authors: Marco Avella Medina, Richard A Davis, Gennady Samorodnitsky

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work studies the theoretical performance of spectral clustering based on a random k-nearest neighbor graph constructed from an extremal sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. In particular, we derive the asymptotic distribution of extremes arising from a linear factor model and prove that, under certain conditions, spectral clustering can consistently identify the clusters of extremes arising in this model. Leveraging this result we propose a simple consistent estimation strategy for learning the angular measure. Our theoretical findings are complemented with numerical experiments illustrating the finite sample performance of our methods.
Researcher Affiliation Academia Marco Avella Medina EMAIL Department of Statistics Columbia University New York, NY, USA Richard A. Davis EMAIL Department of Statistics Columbia University New York, NY, USA Gennady Samorodnitsky EMAIL School of Operations Research and Information Engineering Cornell University Ithaca, NY, USA
Pseudocode Yes The spectral clustering algorithm of Ng et al. (2002) proceeds as follows: 1. Compute the first m eigenvectors u1, . . . , um of L (i.e., the eigenvectors corresponding to the m smallest eigenvalues of L) and define an Nn m matrix U using these eigenvectors. 2. Form an Nn m matrix V by normalizing the rows of U to have unit norm. 3. Treating each of the Nn rows of V as a vector in Rm, cluster them into m clusters C1, . . . , Cm using the K-means algorithm. 4. Assign the original points Xi to cluster Cj if and only if row i of the matrix V was assigned to cluster Cj.
Open Source Code No The paper does not provide explicit access to source code for the methodology described. It only mentions a license for the paper itself.
Open Datasets Yes We revisit the data analyzed by Heffernan and Tawn (2004) and Janßen and Wan (2020). It is available in the R package texmex and consists of daily measurements of five air pollutants in the city of Leeds, UK.
Dataset Splits Yes It was collected between 1994 and 1998, and split into summer and winter months yielding a total of 578 and 532 observations respectively. Following standard practice in multivariate extremes data analysis we standardize the marginal distribution of the data to focus on the extremal dependence. More specifically, we transform the marginals of the original observations Xi as in Janßen and Wan (2020) i.e., we let Yij := 1/{1 Fnj(Xij)}, where Fnj(x) = 1n Pn i=1 1(Xi x) denotes the jth marginal empirical cumulative distribution function, x R and j = 1, . . . , d. We then proceed to define the extremal observations as the 10% of the transformed observations {Yi} with largest Euclidean norm and analyze their angular components with our algorithm. We consider sample sizes n = {1000, 5000, 25000, 125000} and take a sample of extremes corresponding to observations whose Euclidean norm is larger or equal to the following vector of corresponding sample quantiles: β = {0.9, 0.96, 0.984, 0.9968}, respectively. These quantiles were chosen to lead to samples of extremes of sizes Nn = {100, 200, 400, 800}, correspondingly.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper mentions the use of the 'R package texmex' but does not provide specific version numbers for R or the package.
Experiment Setup Yes In all the examples considered below we compute weighted adjacency matrices using the exponential kernel d(x, y) = exp( s x y ) with s = 1 and select the number of clusters as suggested by the screeplots of the fully connected weighted adjacency matrices W. It matched nicely the correct number of clusters, when known. We consider sample sizes n = {1000, 5000, 25000, 125000} and take a sample of extremes corresponding to observations whose Euclidean norm is larger or equal to the following vector of corresponding sample quantiles: β = {0.9, 0.96, 0.984, 0.9968}, respectively. These quantiles were chosen to lead to samples of extremes of sizes Nn = {100, 200, 400, 800}, correspondingly. For these extremes we define kn-nearest neighbors graphs with kn = Nn C log Nn + 1, the corresponding values of the constants are in the vector C = {3, 5, 7, 9} . Based on the screeplots, we used 2 clusters in the noiseless case and 3 clusters in the case with noise.