Faster optimal univariate microaggregation
Authors: Felix I. Stamm, Michael T Schaub
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments we show that the presented algorithms lead to performance improvements on real hardware. We verify that the presented theoretical considerations lead to significant empirical performance improvements in Section 5. Section 5, titled 'Experiments', includes 'Runtime Experiments' and 'Solving Multivariate Microaggregation through projection' with figures displaying 'Runtime on random data of size 1 million' and 'Reconstruction Error for the Multivariate Microaggregation task on real world datasets'. |
| Researcher Affiliation | Academia | Felix I. Stamm EMAIL RWTH Aachen University Germany Michael T. Schaub EMAIL RWTH Aachen University Germany |
| Pseudocode | Yes | Algorithm 1: Pseudo code for the staggered algorithm. Algorithm 2: Pseudo code for the classical algorithm used to solve least weight subsequence problems. Algorithm 3: The backtrack algorithm converts an implicit cluster representation array b into an explicit cluster representation. |
| Open Source Code | Yes | An implementation of the presented ideas is available at https://github.com/Feelx234/microagg1d and the code can also be found at https://doi.org/10.5281/zenodo.10459327. |
| Open Datasets | Yes | We generate our synthetic data set by sampling one million reals uniformly at random between zero and one. The EIA dataset (Brand et al., 2002)4 consist of 4092 instances and 15 columns... 4https://github.com/sdc Tools/sdc Micro/blob/master/data/EIA.rda The Tarragona dataset (Brand et al., 2002)5 has 834 instances and 13 columns. 5https://github.com/sdc Tools/sdc Micro/blob/master/data/Tarragona.rda |
| Dataset Splits | No | The paper uses synthetic data generated by sampling (which doesn't require splits for training/testing in this context) and real-world datasets (EIA and Tarragona). For the real-world datasets, it mentions averaging results over 10 runs but does not specify any explicit training, validation, or test splits. The evaluation focuses on reconstruction error on the given datasets, implying no explicit partitioning into these distinct sets. |
| Hardware Specification | No | The paper mentions 'On most current hardware' and discusses 'performance improvements on real hardware' but does not provide any specific details about the CPU, GPU, memory, or other hardware components used for running the experiments. |
| Software Dependencies | No | All the methods were implemented in python and compiled with the numba compiler. (No specific version numbers are provided for Python or Numba). |
| Experiment Setup | Yes | For low values of the minimum group size k the simple dynamic programs are faster than the O(n) algorithms... For the k-means + cleanup based approach, we first performed k-means where the number of clusters k is f dataset_size/k where f is a factor of 0.5, 1 or 2 as indicated in the legend. For the random projection approach one run consist of making 10, 50, and 100 random projections (as indicated in the legend) and taking their minimum. |