reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Orthogonal Random Features: Explicit Forms and Sharp Inequalities

Authors: Nizar Demni, Hachem Kadri

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we provide experimental results on synthetic and real data that corroborate our theoretical findings. We generate synthetic data with dimension d = 300 and varying values of the random features p = {10, 50, 100, 150, 200, 250, 300}. The data are randomly generated from a normal distribution with zero mean and unit variance. We compute Memp := 1 s Ps l=1 kl(x, y) and Vemp := 1 s Ps l=1( kl(x, y) Memp)2, the empirical bias and variance of k ORF respectively. Each kernel kl is computed using a random Haar orthogonal matrix Ol, i.e., kl(x, y) = ϕl(x) ϕl(y) where ϕl(x) = 1 p sin(wl 1 x), . . . , sin(wl p x), cos(wl 1 x), . . . , cos(wl p x) and wl 1, . . . , wl p are the columns of Ol. The experiment is repeated 10 times with different random seeds. Figure 1 shows the approximation errors Memp E[ k ORF (x, y)] and Vemp V [ k ORF (x, y)] for s = 50 and for different values of p. We also conduct experiments on real-world datasets to confirm our theoretical findings. The accuracy of the kernel estimation is calculated by measuring the mean squared error (MSE) between the true kernel matrix and the approximated one.
Researcher Affiliation	Academia	Nizar Demni EMAIL Department of Mathematics Aix-Marseille University, CNRS, LIS Marseille, France Hachem Kadri EMAIL Department of Computer Science Aix-Marseille University, CNRS, LIS Marseille, France
Pseudocode	No	The paper describes mathematical derivations and propositions but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about making code available, nor does it provide links to a code repository.
Open Datasets	Yes	Ionosphere2 34 351 Ovariancancer3 100 216 Campaign4 62 41,188 Backdoor5 196 95,329 Ionosphere data from the UCI machine learning repository: https://archive.ics.uci.edu/dataset/52/ionosphere. Ovarian cancer data (Conrads et al., 2004): https://fr.mathworks.com/help/stats/sample-data-sets.html. Campaign data is a data set of direct bank marketing campaigns via phone calls (Pang et al., 2019): https://github.com/ Guansong Pang/ADRepository-Anomaly-detection-datasets#numerical-datasets. Backdoor attack detection data extracted from the UNSW-NB 15 dataset (Moustafa & Slay, 2015): https://github.com/ Guansong Pang/ADRepository-Anomaly-detection-datasets#numerical-datasets.
Dataset Splits	No	The paper mentions generating synthetic data from a normal distribution and using real-world datasets (Ionosphere, Ovariancancer, Campaign, Backdoor), but does not specify any training/test/validation splits, percentages, or cross-validation methodology for these datasets.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments.
Software Dependencies	No	The paper does not mention any specific software libraries or tools with their version numbers that were used for the implementation or experiments.
Experiment Setup	Yes	We generate synthetic data with dimension d = 300 and varying values of the random features p = {10, 50, 100, 150, 200, 250, 300}. The data are randomly generated from a normal distribution with zero mean and unit variance. The experiment is repeated 10 times with different random seeds. The Gaussian kernel bandwidth σ is set as the average distance between all pairs of data points, i.e., σ = q 1/n2Pn i,j=1 xi xj 2. The experiment is repeated five times with different random seeds.