Faster optimal univariate microaggregation

Authors: Felix I. Stamm, Michael T Schaub

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments we show that the presented algorithms lead to performance improvements on real hardware. We verify that the presented theoretical considerations lead to significant empirical performance improvements in Section 5. Section 5, titled 'Experiments', includes 'Runtime Experiments' and 'Solving Multivariate Microaggregation through projection' with figures displaying 'Runtime on random data of size 1 million' and 'Reconstruction Error for the Multivariate Microaggregation task on real world datasets'.
Researcher Affiliation Academia Felix I. Stamm EMAIL RWTH Aachen University Germany Michael T. Schaub EMAIL RWTH Aachen University Germany
Pseudocode Yes Algorithm 1: Pseudo code for the staggered algorithm. Algorithm 2: Pseudo code for the classical algorithm used to solve least weight subsequence problems. Algorithm 3: The backtrack algorithm converts an implicit cluster representation array b into an explicit cluster representation.
Open Source Code Yes An implementation of the presented ideas is available at https://github.com/Feelx234/microagg1d and the code can also be found at https://doi.org/10.5281/zenodo.10459327.
Open Datasets Yes We generate our synthetic data set by sampling one million reals uniformly at random between zero and one. The EIA dataset (Brand et al., 2002)4 consist of 4092 instances and 15 columns... 4https://github.com/sdc Tools/sdc Micro/blob/master/data/EIA.rda The Tarragona dataset (Brand et al., 2002)5 has 834 instances and 13 columns. 5https://github.com/sdc Tools/sdc Micro/blob/master/data/Tarragona.rda
Dataset Splits No The paper uses synthetic data generated by sampling (which doesn't require splits for training/testing in this context) and real-world datasets (EIA and Tarragona). For the real-world datasets, it mentions averaging results over 10 runs but does not specify any explicit training, validation, or test splits. The evaluation focuses on reconstruction error on the given datasets, implying no explicit partitioning into these distinct sets.
Hardware Specification No The paper mentions 'On most current hardware' and discusses 'performance improvements on real hardware' but does not provide any specific details about the CPU, GPU, memory, or other hardware components used for running the experiments.
Software Dependencies No All the methods were implemented in python and compiled with the numba compiler. (No specific version numbers are provided for Python or Numba).
Experiment Setup Yes For low values of the minimum group size k the simple dynamic programs are faster than the O(n) algorithms... For the k-means + cleanup based approach, we first performed k-means where the number of clusters k is f dataset_size/k where f is a factor of 0.5, 1 or 2 as indicated in the legend. For the random projection approach one run consist of making 10, 50, and 100 random projections (as indicated in the legend) and taking their minimum.