Orthogonal Random Features: Explicit Forms and Sharp Inequalities
Authors: Nizar Demni, Hachem Kadri
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide experimental results on synthetic and real data that corroborate our theoretical findings. We generate synthetic data with dimension d = 300 and varying values of the random features p = {10, 50, 100, 150, 200, 250, 300}. The data are randomly generated from a normal distribution with zero mean and unit variance. We compute Memp := 1 s Ps l=1 kl(x, y) and Vemp := 1 s Ps l=1( kl(x, y) Memp)2, the empirical bias and variance of k ORF respectively. Each kernel kl is computed using a random Haar orthogonal matrix Ol, i.e., kl(x, y) = ϕl(x) ϕl(y) where ϕl(x) = 1 p sin(wl 1 x), . . . , sin(wl p x), cos(wl 1 x), . . . , cos(wl p x) and wl 1, . . . , wl p are the columns of Ol. The experiment is repeated 10 times with different random seeds. Figure 1 shows the approximation errors Memp E[ k ORF (x, y)] and Vemp V [ k ORF (x, y)] for s = 50 and for different values of p. We also conduct experiments on real-world datasets to confirm our theoretical findings. The accuracy of the kernel estimation is calculated by measuring the mean squared error (MSE) between the true kernel matrix and the approximated one. |
| Researcher Affiliation | Academia | Nizar Demni EMAIL Department of Mathematics Aix-Marseille University, CNRS, LIS Marseille, France Hachem Kadri EMAIL Department of Computer Science Aix-Marseille University, CNRS, LIS Marseille, France |
| Pseudocode | No | The paper describes mathematical derivations and propositions but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about making code available, nor does it provide links to a code repository. |
| Open Datasets | Yes | Ionosphere2 34 351 Ovariancancer3 100 216 Campaign4 62 41,188 Backdoor5 196 95,329 Ionosphere data from the UCI machine learning repository: https://archive.ics.uci.edu/dataset/52/ionosphere. Ovarian cancer data (Conrads et al., 2004): https://fr.mathworks.com/help/stats/sample-data-sets.html. Campaign data is a data set of direct bank marketing campaigns via phone calls (Pang et al., 2019): https://github.com/ Guansong Pang/ADRepository-Anomaly-detection-datasets#numerical-datasets. Backdoor attack detection data extracted from the UNSW-NB 15 dataset (Moustafa & Slay, 2015): https://github.com/ Guansong Pang/ADRepository-Anomaly-detection-datasets#numerical-datasets. |
| Dataset Splits | No | The paper mentions generating synthetic data from a normal distribution and using real-world datasets (Ionosphere, Ovariancancer, Campaign, Backdoor), but does not specify any training/test/validation splits, percentages, or cross-validation methodology for these datasets. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not mention any specific software libraries or tools with their version numbers that were used for the implementation or experiments. |
| Experiment Setup | Yes | We generate synthetic data with dimension d = 300 and varying values of the random features p = {10, 50, 100, 150, 200, 250, 300}. The data are randomly generated from a normal distribution with zero mean and unit variance. The experiment is repeated 10 times with different random seeds. The Gaussian kernel bandwidth σ is set as the average distance between all pairs of data points, i.e., σ = q 1/n2Pn i,j=1 xi xj 2. The experiment is repeated five times with different random seeds. |