Spectral Self-supervised Feature Selection

Authors: Daniel Segal, Ofir Lindenbaum, Ariel Jaffe

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method through experiments on real-world datasets from multiple domains, with a particular emphasis on biological datasets.
Researcher Affiliation Academia Daniel Segal EMAIL Hebrew University Ofir Lindenbaum EMAIL Bar Ilan University Ariel Jaffe EMAIL Hebrew University
Pseudocode Yes Algorithm 1 Pseudo-code for Eigenvector Selection and Pseudo-labels Generation Algorithm 2 Pseudo-code for Spectral Self-supervised Feature Selection (SSFS)
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It mentions using implementations for third-party tools (scikit-learn, XGBoost, scikit-feature library, lscae, KNMFS) but does not provide a link or statement for the authors' own code.
Open Datasets Yes We applied SSFS to eight real-world datasets from various domains. Table 5 in Appendix F.2 gives the number of features, samples, and classes in each dataset. All datasets are available online 1. [Footnote 1: https://jundongl.github.io/scikit-feature/datasets.html]
Dataset Splits No The paper describes an unsupervised feature selection method and evaluates it using clustering accuracy. The evaluation involves applying k-means multiple times on selected features from the entire dataset, rather than defining explicit training, validation, or test splits for model training in a supervised context.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. It discusses computational complexity but does not provide specific details such as GPU/CPU models or memory.
Software Dependencies No For SSFS, we use the following surrogate models: (i) The eigenvector selection model hi is set to Logistic Regression with ℓ2 regularization. We use scikit-learn s (Pedregosa et al., 2011) implementation with a default regularization value of C = 1.0. (ii) The feature selection model fi is set to XGBoost classifier with Gain feature importance. We use the popular implementation by DMLC (Chen and Guestrin, 2016). The paper mentions software by name but does not provide specific version numbers for scikit-learn or XGBoost.
Experiment Setup Yes Number of eigenvectors to select k is set to the distinct number of classes in the specific dataset, they are selected from a total of d = 2k eigenvectors. Size of each subsample is 95% of the original dataset. 500 resamples are performed in every dataset. For the affinity matrix, we used a Gaussian kernel with an adaptive scale σiσj such that σi is the distance to the k = 2 neighbor of xi. The Laplacian we used was the symmetric normalized Laplacian.