Archetypal Analysis++: Rethinking the Initialization Strategy
Authors: Sebastian Mair, Jens Sjölund
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In an extensive empirical evaluation of 15 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones. |
| Researcher Affiliation | Academia | Sebastian Mair EMAIL Uppsala University, Sweden Jens Sjölund EMAIL Uppsala University, Sweden |
| Pseudocode | Yes | The procedure is outlined in Algorithm 1. |
| Open Source Code | Yes | The code is implemented in Python using numpy (Harris et al., 2020) and is publicly available at https://github.com/smair/archetypalanalysis-initialization. |
| Open Datasets | Yes | We use the following seven real-world data sets of varying sizes and dimensionalities. Additional eight real-world data sets, often smaller in size and dimension, are considered in Appendix D. The California Housing (Pace & Barry, 1997) data set... Covertype (Blackard & Dean, 1999)... FMA1 (Defferrard et al., 2017)... KDD-Protein2... Pose is a subset of the Human3.6M data set (Catalin Ionescu, 2011; Ionescu et al., 2014)... RNA (Uzilov et al., 2006)... Million Song Dataset (Bertin-Mahieux et al., 2011)... |
| Dataset Splits | No | The paper does not provide explicit details about training, validation, or test dataset splits. The evaluation of Archetypal Analysis is based on computing the Mean Squared Error (MSE) on the full datasets, which is typical for unsupervised methods that fit models to the entire data rather than using predefined splits for supervised learning tasks. |
| Hardware Specification | Yes | All experiments run on an Intel Xeon machine with 28 cores with 2.60 GHz and 256 GB of memory. |
| Software Dependencies | No | The paper mentions that the code is implemented in Python and uses numpy and scipy (specifically scipy.optimize.nnls), but it does not provide specific version numbers for these software dependencies, which are required for a reproducible description. |
| Experiment Setup | Yes | For various numbers of archetypes k {15, 25, 50, 75, 100}, we initialize archetypal analysis according to each of the baseline strategies and compute the Mean Squared Error (MSE)... In addition, we perform a fixed number of 30 iterations of archetypal analysis based on those initializations... We compute statistics over 30 seeds, except for larger data sets (n > 500, 000 or d > 500) for which we only compute 15 seeds... We apply pre-processing to avoid numerical problems during learning and consider two different approaches: (i) Center And Max Scale... and (ii) Standardization... |