reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Archetypal Analysis++: Rethinking the Initialization Strategy

Authors: Sebastian Mair, Jens Sjölund

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In an extensive empirical evaluation of 15 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones.
Researcher Affiliation	Academia	Sebastian Mair EMAIL Uppsala University, Sweden Jens Sjölund EMAIL Uppsala University, Sweden
Pseudocode	Yes	The procedure is outlined in Algorithm 1.
Open Source Code	Yes	The code is implemented in Python using numpy (Harris et al., 2020) and is publicly available at https://github.com/smair/archetypalanalysis-initialization.
Open Datasets	Yes	We use the following seven real-world data sets of varying sizes and dimensionalities. Additional eight real-world data sets, often smaller in size and dimension, are considered in Appendix D. The California Housing (Pace & Barry, 1997) data set... Covertype (Blackard & Dean, 1999)... FMA1 (Defferrard et al., 2017)... KDD-Protein2... Pose is a subset of the Human3.6M data set (Catalin Ionescu, 2011; Ionescu et al., 2014)... RNA (Uzilov et al., 2006)... Million Song Dataset (Bertin-Mahieux et al., 2011)...
Dataset Splits	No	The paper does not provide explicit details about training, validation, or test dataset splits. The evaluation of Archetypal Analysis is based on computing the Mean Squared Error (MSE) on the full datasets, which is typical for unsupervised methods that fit models to the entire data rather than using predefined splits for supervised learning tasks.
Hardware Specification	Yes	All experiments run on an Intel Xeon machine with 28 cores with 2.60 GHz and 256 GB of memory.
Software Dependencies	No	The paper mentions that the code is implemented in Python and uses numpy and scipy (specifically scipy.optimize.nnls), but it does not provide specific version numbers for these software dependencies, which are required for a reproducible description.
Experiment Setup	Yes	For various numbers of archetypes k {15, 25, 50, 75, 100}, we initialize archetypal analysis according to each of the baseline strategies and compute the Mean Squared Error (MSE)... In addition, we perform a fixed number of 30 iterations of archetypal analysis based on those initializations... We compute statistics over 30 seeds, except for larger data sets (n > 500, 000 or d > 500) for which we only compute 15 seeds... We apply pre-processing to avoid numerical problems during learning and consider two different approaches: (i) Center And Max Scale... and (ii) Standardization...