reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Robust-Equitable Measure for Feature Ranking and Selection

Authors: A. Adam Ding, Jennifer G. Dy, Yi Li, Yale Chang

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both synthetic and real-world data sets conﬁrm the theoretical analysis, and illustrate the advantage of using the dependence measure RCD for feature selection.
Researcher Affiliation	Academia	A. Adam Ding EMAIL Department of Mathematics Northeastern University Boston, MA 02115, USA Jennifer G. Dy EMAIL Department of Electrical and Computer Engineering Northeastern University Boston, MA 02115, USA Yi Li EMAIL Department of Mathematics Northeastern University Boston, MA 02115, USA Yale Chang EMAIL Department of Electrical and Computer Engineering Northeastern University Boston, MA 02115, USA
Pseudocode	No	The paper describes methods such as KNN-based estimator and mRMR but does not present them in a structured pseudocode or algorithm block. No section or figure is explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a direct link to a code repository. While it mentions a CC-BY 4.0 license and a JMLR paper page, this does not constitute a concrete access statement for source code.
Open Datasets	Yes	Consider the stock data set from Stat Lib. This data set provides daily stock prices for ten aerospace companies. Our task is to determine the relative relevance of the stock price of the ﬁrst two companies (X1, X2) in predicting that of the ﬁfth company (Y ). The scatter plots of Y against X1, X2 are presented in Figure 7. Ideally, self-equitable measures should prefer X1 over X2 because the MSE associated with X1 is lower even though it has a more complex function form. As we can see from Table 8, self-equitable measures MI, CD2, and RCD all correctly select X1. While measures that are not self-equitable fail to select the right feature. 1. http://lib.stat.cmu.edu/ and Consider the KEGG metabolic reaction network data set (Lichman, 2013). Our task is to select the most relevant features in predicting target variable Characteristic path length (Y ). The Average shortest path (X1), Eccentricity (X2) and Closeness centrality (X3) are used as candidate features. and M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.
Dataset Splits	Yes	We measure performance by 10-fold cross-validated MSE of spline regression, a general nonlinear predictor (Friedman, 1991), using the selected features.
Hardware Specification	No	The paper describes various experiments and their results, but it does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run these experiments.
Software Dependencies	No	The paper mentions statistical methods and algorithms used, such as 'spline regression', 'kernel based measures', and specific parameter settings for HSNIC and k-NN estimators. However, it does not specify any software libraries, programming languages, or their respective version numbers that were used for implementation.
Experiment Setup	Yes	In this section, we empirically verify the properties of RCD in our theoretical analysis. We ﬁrst check the estimation errors for RCD in synthetic experiments with additive noise and mixture noise respectively. For each type of noise, we simulate data with several different relationships so as to show the effect of self-equitability and robust-equitability respectively. In particular, we compare the RCD estimator with an MI estimator based on the same density estimation. Due to the non-robust-equitability of MI, in the mixture noise cases, the MI estimator varies widely with the sample sizes. In contrast, RCD converges as sample sizes increases. Therefore, MI may provide misleading ranking of features with unequal sample sizes. Also, the ranking between relationships with the two different noise types are greatly affected by the sample sizes under MI, while ranking under RCD remains relatively stable. We then conduct several synthetic experiments to illustrate the properties in feature selection, and then show that similar patterns exist on real-world data sets. and We generate data from the following additive regression model Y = 1.5 cos(3πX1) + (1 2\|2X2 1\|)2 + ϵ, where X1 and X2 are uniformly distributed on [0, 1], and ϵ N(0, 0.05). and The sample size n = 1000 is used in this experiment. and For kernel based measures, we follow the settings used by Fukumizu et al. (2007). For HSNIC, we set the regularization parameter ϵn = 10 5n 3.1 to satisfy the convergence guarantee given by Theorem 5 from Fukumizu et al. (2007). As discussed in the previous section, we set k = 0.25 n for the k-NN estimator of MI, RCD and CD2.