reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Faster optimal univariate microaggregation

Authors: Felix I. Stamm, Michael T Schaub

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments we show that the presented algorithms lead to performance improvements on real hardware. We verify that the presented theoretical considerations lead to significant empirical performance improvements in Section 5. Section 5, titled 'Experiments', includes 'Runtime Experiments' and 'Solving Multivariate Microaggregation through projection' with figures displaying 'Runtime on random data of size 1 million' and 'Reconstruction Error for the Multivariate Microaggregation task on real world datasets'.
Researcher Affiliation	Academia	Felix I. Stamm EMAIL RWTH Aachen University Germany Michael T. Schaub EMAIL RWTH Aachen University Germany
Pseudocode	Yes	Algorithm 1: Pseudo code for the staggered algorithm. Algorithm 2: Pseudo code for the classical algorithm used to solve least weight subsequence problems. Algorithm 3: The backtrack algorithm converts an implicit cluster representation array b into an explicit cluster representation.
Open Source Code	Yes	An implementation of the presented ideas is available at https://github.com/Feelx234/microagg1d and the code can also be found at https://doi.org/10.5281/zenodo.10459327.
Open Datasets	Yes	We generate our synthetic data set by sampling one million reals uniformly at random between zero and one. The EIA dataset (Brand et al., 2002)4 consist of 4092 instances and 15 columns... 4https://github.com/sdc Tools/sdc Micro/blob/master/data/EIA.rda The Tarragona dataset (Brand et al., 2002)5 has 834 instances and 13 columns. 5https://github.com/sdc Tools/sdc Micro/blob/master/data/Tarragona.rda
Dataset Splits	No	The paper uses synthetic data generated by sampling (which doesn't require splits for training/testing in this context) and real-world datasets (EIA and Tarragona). For the real-world datasets, it mentions averaging results over 10 runs but does not specify any explicit training, validation, or test splits. The evaluation focuses on reconstruction error on the given datasets, implying no explicit partitioning into these distinct sets.
Hardware Specification	No	The paper mentions 'On most current hardware' and discusses 'performance improvements on real hardware' but does not provide any specific details about the CPU, GPU, memory, or other hardware components used for running the experiments.
Software Dependencies	No	All the methods were implemented in python and compiled with the numba compiler. (No specific version numbers are provided for Python or Numba).
Experiment Setup	Yes	For low values of the minimum group size k the simple dynamic programs are faster than the O(n) algorithms... For the k-means + cleanup based approach, we first performed k-means where the number of clusters k is f dataset_size/k where f is a factor of 0.5, 1 or 2 as indicated in the legend. For the random projection approach one run consist of making 10, 50, and 100 random projections (as indicated in the legend) and taking their minimum.