reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spectral Self-supervised Feature Selection

Authors: Daniel Segal, Ofir Lindenbaum, Ariel Jaffe

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our method through experiments on real-world datasets from multiple domains, with a particular emphasis on biological datasets.
Researcher Affiliation	Academia	Daniel Segal EMAIL Hebrew University Ofir Lindenbaum EMAIL Bar Ilan University Ariel Jaffe EMAIL Hebrew University
Pseudocode	Yes	Algorithm 1 Pseudo-code for Eigenvector Selection and Pseudo-labels Generation Algorithm 2 Pseudo-code for Spectral Self-supervised Feature Selection (SSFS)
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions using implementations for third-party tools (scikit-learn, XGBoost, scikit-feature library, lscae, KNMFS) but does not provide a link or statement for the authors' own code.
Open Datasets	Yes	We applied SSFS to eight real-world datasets from various domains. Table 5 in Appendix F.2 gives the number of features, samples, and classes in each dataset. All datasets are available online 1. [Footnote 1: https://jundongl.github.io/scikit-feature/datasets.html]
Dataset Splits	No	The paper describes an unsupervised feature selection method and evaluates it using clustering accuracy. The evaluation involves applying k-means multiple times on selected features from the entire dataset, rather than defining explicit training, validation, or test splits for model training in a supervised context.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments. It discusses computational complexity but does not provide specific details such as GPU/CPU models or memory.
Software Dependencies	No	For SSFS, we use the following surrogate models: (i) The eigenvector selection model hi is set to Logistic Regression with ℓ2 regularization. We use scikit-learn s (Pedregosa et al., 2011) implementation with a default regularization value of C = 1.0. (ii) The feature selection model fi is set to XGBoost classifier with Gain feature importance. We use the popular implementation by DMLC (Chen and Guestrin, 2016). The paper mentions software by name but does not provide specific version numbers for scikit-learn or XGBoost.
Experiment Setup	Yes	Number of eigenvectors to select k is set to the distinct number of classes in the specific dataset, they are selected from a total of d = 2k eigenvectors. Size of each subsample is 95% of the original dataset. 500 resamples are performed in every dataset. For the affinity matrix, we used a Gaussian kernel with an adaptive scale σiσj such that σi is the distance to the k = 2 neighbor of xi. The Laplacian we used was the symmetric normalized Laplacian.