reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Worst-Case Dimensionality Reduction for Sparse Vectors

Authors: Sandeep Silwal, David Woodruff, Qiuyi (Richard) Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We study beyond worst-case dimensionality reduction for s-sparse vectors (vectors with at most s non-zero coordinates). Our work first presents pessimistic lower bounds and then bypasses them by introducing a non-negative assumption: (a) We first consider average-case guarantees for embedding s-sparse vectors. Here, a well-known folklore upper bound based on the birthday-paradox states: For any collection X of s-sparse vectors in Rd, there exists a linear map A : Rd RO(s2) which exactly preserves the norm of 99% of the vectors in X in any ℓp norm (as opposed to the usual setting where guarantees hold for all vectors). We provide novel lower bounds showing that this is indeed optimal in many settings. Specifically, any oblivious linear map satisfying similar average-case guarantees must map to Ω(s2) dimensions. ... (b) Given these lower bounds, we specialize to sparse non-negative vectors with hopes of improved upper bounds. For a dataset X of non-negative s-sparse vectors and any p 1, we show that we can non-linearly embed X to O(s log(\|X\|s)/ε2) dimensions while preserving all pairwise distances in ℓp norm up to 1 ε, with no dependence on p. ... Our map also guarantees exact dimensionality reduction for the ℓ norm by embedding X into O(s log \|X\|) dimensions, which is tight. We further give separation results showing that both the non-linearity of f and the non-negativity of X are necessary, and provide downstream algorithmic improvements using our embedding.
Researcher Affiliation	Collaboration	Sandeep Silwal UW-Madison EMAIL, David P. Woodruff CMU and Google EMAIL, Qiuyi Zhang Google Deep Mind EMAIL
Pseudocode	No	The paper describes algorithms and methods in prose, definitions, and theorems (e.g., Definition 1.1, Definition C.1, Theorem C.2), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not contain any explicit statements about providing open-source code or links to a code repository for the described methodology.
Open Datasets	No	The paper primarily presents theoretical results (lower bounds, upper bounds, algorithm design). It does not conduct experiments on specific, publicly available datasets. Figure 1 mentions '10-sparse vectors in R1000 with non-zero entries chosen uniformly in [0, 1]', which refers to a synthetic data generation process rather than a publicly accessible dataset.
Dataset Splits	No	The paper is theoretical and does not conduct experiments that involve splitting datasets into training, testing, or validation sets.
Hardware Specification	No	The paper is theoretical and does not describe any experimental setup that would require hardware specifications. Therefore, no hardware details are provided.
Software Dependencies	No	The paper is theoretical and does not describe any experimental setup that would require specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical, focusing on mathematical proofs, lower bounds, and algorithm design. It does not contain an experimental setup with hyperparameters or system-level training settings.