Spectral Self-supervised Feature Selection
Authors: Daniel Segal, Ofir Lindenbaum, Ariel Jaffe
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our method through experiments on real-world datasets from multiple domains, with a particular emphasis on biological datasets. |
| Researcher Affiliation | Academia | Daniel Segal EMAIL Hebrew University Ofir Lindenbaum EMAIL Bar Ilan University Ariel Jaffe EMAIL Hebrew University |
| Pseudocode | Yes | Algorithm 1 Pseudo-code for Eigenvector Selection and Pseudo-labels Generation Algorithm 2 Pseudo-code for Spectral Self-supervised Feature Selection (SSFS) |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using implementations for third-party tools (scikit-learn, XGBoost, scikit-feature library, lscae, KNMFS) but does not provide a link or statement for the authors' own code. |
| Open Datasets | Yes | We applied SSFS to eight real-world datasets from various domains. Table 5 in Appendix F.2 gives the number of features, samples, and classes in each dataset. All datasets are available online 1. [Footnote 1: https://jundongl.github.io/scikit-feature/datasets.html] |
| Dataset Splits | No | The paper describes an unsupervised feature selection method and evaluates it using clustering accuracy. The evaluation involves applying k-means multiple times on selected features from the entire dataset, rather than defining explicit training, validation, or test splits for model training in a supervised context. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. It discusses computational complexity but does not provide specific details such as GPU/CPU models or memory. |
| Software Dependencies | No | For SSFS, we use the following surrogate models: (i) The eigenvector selection model hi is set to Logistic Regression with ℓ2 regularization. We use scikit-learn s (Pedregosa et al., 2011) implementation with a default regularization value of C = 1.0. (ii) The feature selection model fi is set to XGBoost classifier with Gain feature importance. We use the popular implementation by DMLC (Chen and Guestrin, 2016). The paper mentions software by name but does not provide specific version numbers for scikit-learn or XGBoost. |
| Experiment Setup | Yes | Number of eigenvectors to select k is set to the distinct number of classes in the specific dataset, they are selected from a total of d = 2k eigenvectors. Size of each subsample is 95% of the original dataset. 500 resamples are performed in every dataset. For the affinity matrix, we used a Gaussian kernel with an adaptive scale σiσj such that σi is the distance to the k = 2 neighbor of xi. The Laplacian we used was the symmetric normalized Laplacian. |