Beyond Worst-Case Dimensionality Reduction for Sparse Vectors
Authors: Sandeep Silwal, David Woodruff, Qiuyi (Richard) Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study beyond worst-case dimensionality reduction for s-sparse vectors (vectors with at most s non-zero coordinates). Our work first presents pessimistic lower bounds and then bypasses them by introducing a non-negative assumption: (a) We first consider average-case guarantees for embedding s-sparse vectors. Here, a well-known folklore upper bound based on the birthday-paradox states: For any collection X of s-sparse vectors in Rd, there exists a linear map A : Rd RO(s2) which exactly preserves the norm of 99% of the vectors in X in any ℓp norm (as opposed to the usual setting where guarantees hold for all vectors). We provide novel lower bounds showing that this is indeed optimal in many settings. Specifically, any oblivious linear map satisfying similar average-case guarantees must map to Ω(s2) dimensions. ... (b) Given these lower bounds, we specialize to sparse non-negative vectors with hopes of improved upper bounds. For a dataset X of non-negative s-sparse vectors and any p 1, we show that we can non-linearly embed X to O(s log(|X|s)/ε2) dimensions while preserving all pairwise distances in ℓp norm up to 1 ε, with no dependence on p. ... Our map also guarantees exact dimensionality reduction for the ℓ norm by embedding X into O(s log |X|) dimensions, which is tight. We further give separation results showing that both the non-linearity of f and the non-negativity of X are necessary, and provide downstream algorithmic improvements using our embedding. |
| Researcher Affiliation | Collaboration | Sandeep Silwal UW-Madison EMAIL, David P. Woodruff CMU and Google EMAIL, Qiuyi Zhang Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes algorithms and methods in prose, definitions, and theorems (e.g., Definition 1.1, Definition C.1, Theorem C.2), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not contain any explicit statements about providing open-source code or links to a code repository for the described methodology. |
| Open Datasets | No | The paper primarily presents theoretical results (lower bounds, upper bounds, algorithm design). It does not conduct experiments on specific, publicly available datasets. Figure 1 mentions '10-sparse vectors in R1000 with non-zero entries chosen uniformly in [0, 1]', which refers to a synthetic data generation process rather than a publicly accessible dataset. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments that involve splitting datasets into training, testing, or validation sets. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup that would require hardware specifications. Therefore, no hardware details are provided. |
| Software Dependencies | No | The paper is theoretical and does not describe any experimental setup that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical, focusing on mathematical proofs, lower bounds, and algorithm design. It does not contain an experimental setup with hyperparameters or system-level training settings. |