Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

Authors: Jie Gao, Rajesh Jayaram, Benedikt Kolbe, Shay Sapir, Chris Schwiegelshohn, Sandeep Silwal, Erik Waingarten

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theoretical results with an empirical evaluation. Our goal is to convey two key messages: first, randomized dimensionality reduction can be very effective in reducing the runtime of geometric computation in high dimensions. ... The second and the more salient point we demonstrate is that the effects of doubling dimension is an empirically observable phenomenon which can be quantitatively measured. ... Datasets We use two real and one synthetic dataset. ... Optimization Problems We focus our attention to three representative problems among the many that we study. ... For each problem, we compute a solution (using the aforementioned algorithms) in the original dimension. Then we reduce the dimension of our datasets, varying the target dimension, and run the same computation on the projected data. We then compare the values of the two solutions found and measure the relative error between them. In all of our results we plot the average over at least 20 independent trials and one standard deviation error is shaded.
Researcher Affiliation Collaboration 1Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2Google Research, NYC, USA 3Hausdorff Center for Mathematics, Lamarr Institute for Machine Learning and Artificial Intelligence, University of Bonn, Germany 4Department of Computer Science, Weizmann Institute of Science, Israel 5Department of Computer Science, Aarhus University, Denmark 6Department of Computer Sciences, University of Wisconsin-Madison, USA 7University of Pennsylvania, Philadelphia, USA.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Algorithmic steps are described in the main text or are part of mathematical proofs and derivations.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. It mentions: "We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM.", but this refers to their internal implementation, not an open-source release.
Open Datasets Yes Datasets We use two real and one synthetic dataset. ... Dataset 1: MNIST. We select 1000 randomly chosen images from the MNIST dataset ... Dataset 2: CIFAR-embeddings. We use penultimate layer embeddings of pre-trained Res Net models ... This is labeled Cifar-High in our figures.
Dataset Splits No The paper mentions using "1000 randomly chosen images from the MNIST dataset" and "1000 randomly chosen embeddings" from CIFAR. For max-matching, it states: "For maximum matching, we experiment with the bipartite version by simply dividing our dataset in half." However, no explicit training/validation/test splits are provided as would be common for machine learning model training.
Hardware Specification Yes We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM.
Software Dependencies Yes We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM. ... This allows us to use an efficient exact solver using Sci Py (Virtanen et al., 2020).
Experiment Setup Yes For each problem, we compute a solution (using the aforementioned algorithms) in the original dimension. Then we reduce the dimension of our datasets, varying the target dimension, and run the same computation on the projected data. We then compare the values of the two solutions found and measure the relative error between them. In all of our results we plot the average over at least 20 independent trials and one standard deviation error is shaded. ... For maximum matching, we experiment with the bipartite version by simply dividing our dataset in half. This allows us to use an efficient exact solver using Sci Py (Virtanen et al., 2020). The other two problems correspond to well-studied dataset diversity measures, many known to be NP-Hard to optimize exactly (Erkut, 1990; Chandra & Halld orsson, 2001), so we use a greedy algorithm as a proxy for finding the optimum. ... We set n = 1000 in our experiments. ... We set k = 10 here (see Figure 4 for k = 20).