Archetypal Analysis++: Rethinking the Initialization Strategy

Authors: Sebastian Mair, Jens Sjölund

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In an extensive empirical evaluation of 15 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones.
Researcher Affiliation Academia Sebastian Mair EMAIL Uppsala University, Sweden Jens Sjölund EMAIL Uppsala University, Sweden
Pseudocode Yes The procedure is outlined in Algorithm 1.
Open Source Code Yes The code is implemented in Python using numpy (Harris et al., 2020) and is publicly available at https://github.com/smair/archetypalanalysis-initialization.
Open Datasets Yes We use the following seven real-world data sets of varying sizes and dimensionalities. Additional eight real-world data sets, often smaller in size and dimension, are considered in Appendix D. The California Housing (Pace & Barry, 1997) data set... Covertype (Blackard & Dean, 1999)... FMA1 (Defferrard et al., 2017)... KDD-Protein2... Pose is a subset of the Human3.6M data set (Catalin Ionescu, 2011; Ionescu et al., 2014)... RNA (Uzilov et al., 2006)... Million Song Dataset (Bertin-Mahieux et al., 2011)...
Dataset Splits No The paper does not provide explicit details about training, validation, or test dataset splits. The evaluation of Archetypal Analysis is based on computing the Mean Squared Error (MSE) on the full datasets, which is typical for unsupervised methods that fit models to the entire data rather than using predefined splits for supervised learning tasks.
Hardware Specification Yes All experiments run on an Intel Xeon machine with 28 cores with 2.60 GHz and 256 GB of memory.
Software Dependencies No The paper mentions that the code is implemented in Python and uses numpy and scipy (specifically scipy.optimize.nnls), but it does not provide specific version numbers for these software dependencies, which are required for a reproducible description.
Experiment Setup Yes For various numbers of archetypes k {15, 25, 50, 75, 100}, we initialize archetypal analysis according to each of the baseline strategies and compute the Mean Squared Error (MSE)... In addition, we perform a fixed number of 30 iterations of archetypal analysis based on those initializations... We compute statistics over 30 seeds, except for larger data sets (n > 500, 000 or d > 500) for which we only compute 15 seeds... We apply pre-processing to avoid numerical problems during learning and consider two different approaches: (i) Center And Max Scale... and (ii) Standardization...