Model-free Variable Selection in Reproducing Kernel Hilbert Space

Authors: Lei Yang, Shaogao Lv, Junhui Wang

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of the proposed method is also supported by a variety of simulated examples and two real-life examples. Keywords: group Lasso, high-dimensional data, kernel regression, learning gradients, reproducing kernel Hilbert space (RKHS), variable selection. This section examines the effectiveness of the proposed model-free variable selection method, and compares it against some popular model based methods in literature, including variable selection with the additive model (Xue, 2009), Cosso (Lin and Zhang, 2006), sparse gradient learning (Ye and Xie, 2012) and multivariate adaptive regression splines (Friedman, 1991), denoted as MF, Add, Cosso, SGL and Mars respectively. In all the experiments, the Gaussian kernel K(x, u) = e x u 2 2/2σ2 n is used, where the scalar parameters σ2 n and τ 2 n in w(x, u) are set as the median over the pairwise distances among all the sample points (Mukherjee and Zhou, 2006). Other tuning parameters in these competitors, such as the number of knots in Xue (2009), are set as the default values in the available R and Matlab packages. The tuning parameters in each method are determined by the stability-based selection criterion in Sun et al. (2013). The idea is to conduct a cross-validation-like scheme, and measure the stability as the agreement between two estimated active sets. It randomly splits the training set into two subsets, applies the candidate variable selection method to each subset, and obtains two estimated active sets, denoted as b A1b and b A2b. The variable selection stability can approximated by sλ = 1 B PB b=1 κ( b A1b, b A2b), where B is the number of splitting in the cross validation scheme, and κ( , ) is the standard Cohen s kappa statistic measuring the agreement between two sets. The tuning parameter λ is then selected as the one maximizing sλ. Finally, the performance of all methods is evaluated by a number of measures regarding the variable selection accuracy. Two simulated examples are considered. The first example was used in Xue (2009) and Huang et al. (2010), where the true regression model is an additive model. The second example modifies the generating scheme of the first one and includes interaction terms. Example 1: First generate p-dimensional variables xi = (xi1, . . . , xip)T with xij = Wij+ηUi 1+η , where Wij and Ui are independently from U( 0.5, 0.5), for i = 1, . . . , n and j = 1, . . . , p. When η = 0 all variables are independent, whereas when η = 1 correlation presents among the variables. Next, set f (xi) = 5f1(xi1) + 3f2(xi2) + 4f3(xi3) + 6f4(xi4), with f1(u) = u, f2(u) = (2u 1)2, f3(u) = sin(πu) 2 sin(πu), and f4(u) = 0.1 sin(πu)+0.2 cos(πu)+ 0.3 sin2(πu) + 0.4 cos3(πu) + 0.5 sin3(πu). Finally, generate yi by yi = f(xi) + ϵi with ϵi N(0, 1.312). Clearly, the true underlying regression model is additive. Example 2: The generating scheme is similar as Example 1, except that f (xi) = (2xi1 1)(2xi2 1), Wij and Ui are independently from N(0, 1) and ϵi N(0, 1). It is clear that the underlying regression model includes interaction terms, and thus the additive model assumption is no longer valid. For each example, different scenarios are considered with η = 0 or 1, and (n, p) = (100, 10), (100, 20) or (200, 50). Each scenario is replicated 50 times, and the averaged performance measures are summarized in Tables 1 and 2. Specifically, Size represents the averaged number of selected informative variables, TP represents the number of truly informative variables selected, FP represents the number of truly non-informative variables selected and C, U, O are the times of correct-fitting, under-fitting and over-fitting, respectively. It is evident that the proposed MF method delivers superior selection performance against the other three competitors. In Table 1 where the true model is indeed additive, MF performs similarly as Add and SGL, whereas Cosso and Mars appear more likely to overfit. In Table 2 where the true model consists of interaction terms, the performance of MF becomes competitive, but Add tends to under-fit more frequently, and Cosso, Mars and SGL tend to overfit as the dimension increases. Furthermore, in both examples with η = 1, it is clear that the correlation among variables increases the difficulty of selecting the truly informative variables, yet the proposed MF method still outperforms its competitors. Furthermore, it is also noted that the estimation accuracy of MF outperforms SGL, but it is omitted here as only MF and SGL estimate the gradient function g, whereas Add, Cosso and Mars estimate the regression function f. The proposed model-free variable selection method is applied to three real data examples, the Boston housing data, the Ozone concentration data, and the digit recognition data, all of which are publicly available. The Boston housing data concerns the median value of owner-occupied homes in each of the 506 census tracts in the Boston Standard Metropolitan Statistical Area in 1970. It consists of 13 variables, including per capita crime rate by town (CRIM), proportion of residential land zoned for lots over 25,000 square feet (ZN), proportion of non-retail business acres per town (INDUS), Charles River dummy variable (CHAS), nitric oxides concentration (NOX), average number of rooms per dwelling (RM), proportion of owner-occupied units built prior to 1940 (AGE), weighted distances to five Boston employment centers (DIS), index of accessibility to radial highways (RAD), fullvalue property-tax rate per $10000 (TAX), pupil-teacher ratio by town (PTRATIO), the proportion of blacks by town (B), lower status of the population (LSTAT), which may affect the housing price. The Ozone concentration data concerns the daily measurements of Ozone concentration in Los Angeles basin in 330 days. The Ozone concertration may be influenced by 11 meteorological quantities, such as month (M), day of month (DM), day of week (DW), Vandenburg 500 millibar height (VDHT), wind speed (WDSP), humidity (HMDT), temperature at Sandburg (SBTH), inversion base height (IBHT), Daggett pressure gradient (DGPG), inversion base temperature (IBTP) and visibility (VSTY). These two datasets have been widely analyzed in literature, including Breiman and Friedman (1985), Xue (2009), and Lin and Zhang (2006). For the digit recognition data, each digit is described by a 8 8 gray-scale image with each entry ranging from 0 to 16. We focus on digits 3 and 5 due to their similarity, and the resultant dataset consists of 365 observations and 64 attributes. In our analysis, all the variables and responses are standardized and the selected variables are summarized. The selected informative variables by MF, Add, Cosso and Mars are summarized in Tables 3 and 4. As the truly informative variables are unknown in real examples, averaged prediction errors with the selected variables are also reported to compare the performance. To compute the averaged prediction error, each dataset is randomly split into two parts: m observations for testing and the remaining for training. Specifically, m = 30 for the Boston housing data, m = 50 for the Ozone concentration data, and m = 35 for digit recognition data. Each example is replicated 100 times, and the averaged prediction errors by MF, Add, Cosso and Mars are summarized in Tables 3-5. For the Boston housing data, MF and SGL select two informative variables, RM and LSTAT, whereas Add, Cosso and Mars tend to select more variables. However, the corresponding prediction errors of Add, Cosso and Mars appear to be larger than that of MF and SGL, implying that the additional selected variables by Add, Cosso and Mars may hinder the prediction performance. For the Ozone concentration data, both MF and Cosso select four variables but Add and Mars select more. One discrepancy is the variable IBTP, which is selected by MF, Cosso, SGL and Mars but not by Add. As claimed in Gregory et al. (2012), M, SBTH and IBTP are three most important meteorological variables related to Ozone concentration as all of them describe the temperature changes. Meanwhile, MF and Cosso show smaller prediction error than SGL and Mars, which implies that SGL and Mars may include some non-informative variables. Figure 1 displays scatter plots of the responses against the selected variables by MF in the Boston housing data and the Ozone concentration data. It is clear that all the selected variables show moderate to strong relationship with the responses. For digit recognition data, MF selects much less variables than the other competitors and provides smaller prediction error. Figure 2 shows some randomly selected digits of 3 and 5 and the two selected informative variables, where the left informative variable is always contained in digit 5 and the right one is always contained in digit 3.
Researcher Affiliation Academia Lei Yang EMAIL Department of Population Health New York Univeristy New York, NY, 10016, USA Shaogao Lv EMAIL Center of Statistics Southwestern University of Finance and Economics Chengdu, Sichuan, 610074, China Junhui Wang EMAIL Department of Mathematics City University of Hong Kong Kowloon Tong, 999077, Hong Kong
Pseudocode Yes To solve (4), we develop a block coordinate descent algorithm as in Yang and Zou (2015). First, after dropping the α-unrelated terms, the cost function in (4) can be simplified as argmin α αT U + 1 2αT Mα + λn l=1 πl α(l) 2, (5) where αT = (α(1))T , . . . , (α(p))T , U = 2 n(n 1) Pn i,j=1 wij Uij, M = 2 n(n 1) Pn i,j=1 wij Mij, Uij = (yi yj)(xi xj) Ki, Mij = (xi xj)(xi xj)T Ki KT i , Ki is the i-th column of K = (K(xi, xj))n n, In is a n-dimensional identity matrix, and denotes the kronecker product. Then we update one α(l) at a time pretending others fixed, and the l-th subproblem becomes argmin α(l) L(α) + λnπl α(l) 2 = αT U + 1 2αT Mα + λnπl α(l) 2, To solve the subproblem, a similar approximation as in Yang and Zou (2015) can be employed, where the updated α(l) is obtained by solving argmin α(l) l L(eα)(α(l) eα(l)) + γ(l) 2 (α(l) eα(l)) T (α(l) eα(l)) + λnπl α(l) 2. (6) Here eα is the current estimate for α, eα(l) is the l-th column of eα, l L(eα) = 2 n(n 1) i,j=1 wij(Mij eα Uij), l L(eα) denotes the l-th block vector of L(eα), and l L(eα) = 2 n(n 1) s=1 ((xi xj)(xi xj)T )ls Ki KT i eα(s) (yi yj)(xil xjl)Ki where ((xi xj)(xi xj)T )ls is the (l, s)-th entry of (xi xj)(xi xj)T . Furthermore, denote γ(l) the largest eigenvalue of M(l) = 2 n(n 1) i,j=1 wij((xi xj)(xi xj)T )ll Ki KT i , which is the l-th n n block diagonal of M. It is straightforward to show that (6) has an analytical solution, α(l) = eα(l) l L(eα) 1 λnπl γ(l)eα(l) l L(eα) 2 The proposed algorithm then iteratively updates α(l) for l = 1, . . . , p, 1, . . . until convergence. The algorithm is guaranteed to converge to the global minimum, since the cost function in (5) is convex and its value is decreased in each updating step. Furthermore, the computational complexity of the block coordinate descent algorithm is O(n2p2D) with D being the number of iterations until convergence, which can be substantially less than the complexity of solving (5) with standard optimization packages.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes The proposed model-free variable selection method is applied to three real data examples, the Boston housing data, the Ozone concentration data, and the digit recognition data, all of which are publicly available.
Dataset Splits Yes To compute the averaged prediction error, each dataset is randomly split into two parts: m observations for testing and the remaining for training. Specifically, m = 30 for the Boston housing data, m = 50 for the Ozone concentration data, and m = 35 for digit recognition data.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments. It only mentions general terms like 'available R and Matlab packages' for competitors.
Software Dependencies No The paper mentions 'R and Matlab packages' and 'Gaussian kernel K(x, u) = e x u 2 2/2σ2 n', but no specific version numbers for any software components are provided.
Experiment Setup Yes In all the experiments, the Gaussian kernel K(x, u) = e x u 2 2/2σ2 n is used, where the scalar parameters σ2 n and τ 2 n in w(x, u) are set as the median over the pairwise distances among all the sample points (Mukherjee and Zhou, 2006). Other tuning parameters in these competitors, such as the number of knots in Xue (2009), are set as the default values in the available R and Matlab packages. The tuning parameters in each method are determined by the stability-based selection criterion in Sun et al. (2013). The idea is to conduct a cross-validation-like scheme, and measure the stability as the agreement between two estimated active sets. It randomly splits the training set into two subsets, applies the candidate variable selection method to each subset, and obtains two estimated active sets, denoted as b A1b and b A2b. The variable selection stability can approximated by sλ = 1 B PB b=1 κ( b A1b, b A2b), where B is the number of splitting in the cross validation scheme, and κ( , ) is the standard Cohen s kappa statistic measuring the agreement between two sets. The tuning parameter λ is then selected as the one maximizing sλ. Finally, the performance of all methods is evaluated by a number of measures regarding the variable selection accuracy. In our analysis, all the variables and responses are standardized and the selected variables are summarized. The selected informative variables by MF, Add, Cosso and Mars are summarized in Tables 3 and 4. As the truly informative variables are unknown in real examples, averaged prediction errors with the selected variables are also reported to compare the performance. To compute the averaged prediction error, each dataset is randomly split into two parts: m observations for testing and the remaining for training. Specifically, m = 30 for the Boston housing data, m = 50 for the Ozone concentration data, and m = 35 for digit recognition data. Each example is replicated 100 times, and the averaged prediction errors by MF, Add, Cosso and Mars are summarized in Tables 3-5.