reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

Authors: Jie Gao, Rajesh Jayaram, Benedikt Kolbe, Shay Sapir, Chris Schwiegelshohn, Sandeep Silwal, Erik Waingarten

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theoretical results with an empirical evaluation. Our goal is to convey two key messages: first, randomized dimensionality reduction can be very effective in reducing the runtime of geometric computation in high dimensions. ... The second and the more salient point we demonstrate is that the effects of doubling dimension is an empirically observable phenomenon which can be quantitatively measured. ... Datasets We use two real and one synthetic dataset. ... Optimization Problems We focus our attention to three representative problems among the many that we study. ... For each problem, we compute a solution (using the aforementioned algorithms) in the original dimension. Then we reduce the dimension of our datasets, varying the target dimension, and run the same computation on the projected data. We then compare the values of the two solutions found and measure the relative error between them. In all of our results we plot the average over at least 20 independent trials and one standard deviation error is shaded.
Researcher Affiliation	Collaboration	1Department of Computer Science, Rutgers University, Piscataway, NJ, USA 2Google Research, NYC, USA 3Hausdorff Center for Mathematics, Lamarr Institute for Machine Learning and Artificial Intelligence, University of Bonn, Germany 4Department of Computer Science, Weizmann Institute of Science, Israel 5Department of Computer Science, Aarhus University, Denmark 6Department of Computer Sciences, University of Wisconsin-Madison, USA 7University of Pennsylvania, Philadelphia, USA.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. Algorithmic steps are described in the main text or are part of mathematical proofs and derivations.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. It mentions: "We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM.", but this refers to their internal implementation, not an open-source release.
Open Datasets	Yes	Datasets We use two real and one synthetic dataset. ... Dataset 1: MNIST. We select 1000 randomly chosen images from the MNIST dataset ... Dataset 2: CIFAR-embeddings. We use penultimate layer embeddings of pre-trained Res Net models ... This is labeled Cifar-High in our figures.
Dataset Splits	No	The paper mentions using "1000 randomly chosen images from the MNIST dataset" and "1000 randomly chosen embeddings" from CIFAR. For max-matching, it states: "For maximum matching, we experiment with the bipartite version by simply dividing our dataset in half." However, no explicit training/validation/test splits are provided as would be common for machine learning model training.
Hardware Specification	Yes	We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM.
Software Dependencies	Yes	We implement our algorithms using Python 3.9.7 on an M1 Macbook Pro with 32GB of RAM. ... This allows us to use an efficient exact solver using Sci Py (Virtanen et al., 2020).
Experiment Setup	Yes	For each problem, we compute a solution (using the aforementioned algorithms) in the original dimension. Then we reduce the dimension of our datasets, varying the target dimension, and run the same computation on the projected data. We then compare the values of the two solutions found and measure the relative error between them. In all of our results we plot the average over at least 20 independent trials and one standard deviation error is shaded. ... For maximum matching, we experiment with the bipartite version by simply dividing our dataset in half. This allows us to use an efficient exact solver using Sci Py (Virtanen et al., 2020). The other two problems correspond to well-studied dataset diversity measures, many known to be NP-Hard to optimize exactly (Erkut, 1990; Chandra & Halld orsson, 2001), so we use a greedy algorithm as a proxy for finding the optimum. ... We set n = 1000 in our experiments. ... We set k = 10 here (see Figure 4 for k = 20).