Rank-based Lasso - efficient methods for high-dimensional robust model selection

Authors: Wojciech Rejchel, Małgorzata Bogdan

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Theoretical results are supported by the simulation study and the real data analysis, which show that our methods can properly identify relevant predictors, even when the error terms come from the Cauchy distribution and the link function is nonlinear. They also demonstrate the superiority of the modified versions of Rank Lasso over its regular version in the case when predictors are substantially correlated. The numerical study shows also that Rank Lasso performs substantially better in model selection than LADLasso, which is a well established methodology for robust model selection.
Researcher Affiliation Academia Wojciech Rejchel EMAIL Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/18, 87-100 Toru n, Poland Ma lgorzata Bogdan EMAIL Faculty of Mathematics and Computer Science University of Wroc law Joliot-Curie 15, 50-383 Wroc law, Poland and Department of Statistics Lund University Tycho Brahes v ag 1, Lund, Sweden
Pseudocode No Our procedure is very simple and relies on replacing actual values of the response variables Yi by their centred ranks. Ranks Ri are defined as j=1 I(Yj Yi), i = 1, . . . , n, (4) where I( ) is the indicator function. Next, we identify significant predictors by simply solving the following Lasso problem; Rank Lasso: ˆθ = arg min θ Rp Q(θ) + λ |θ|1 , (5) Ri/n 0.5 θ Xi 2 . (6) This procedure does not require any dedicated algorithm and can be executed using efficient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Open Source Code No This procedure does not require any dedicated algorithm and can be executed using efficient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Open Datasets Yes The considered data set gene expression was interrogated in lymphoblastoid cell lines of 210 unrelated Hap Map individuals The International Hap Map Consortium (2005) from four populations (60 Utah residents with ancestry from northern and western Europe, 45 Han Chinese in Beijing, 45 Japanese in Tokyo, 60 Yoruba in Ibadan, Nigeria) Stranger et al. (2007). The data set can be found at ftp://ftp.sanger.ac.uk/pub/genevar/ and was previously studied e.g. in Bradic et al. (2011); Fan et al. (2014).
Dataset Splits Yes Next, the data set is divided into two parts: the training set with randomly selected 180 individuals and the test set with remaining 30 individuals.
Hardware Specification No We gratefully acknowledge the grant of the Wroclaw Center of Networking and Supercomputing (WCSS), where most of the computations were performed.
Software Dependencies Yes This procedure does not require any dedicated algorithm and can be executed using efficient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Experiment Setup Yes More specifically, we consider the following pairs (n, p) : (100, 100), (200, 400), (300, 900), (400, 1600). For each of these combinations we consider three different values of the sparsity parameter p0 = #{j : βj = 0} {3, 10, 20}. In three of our simulation scenarios the rows of the design matrix are generated as independent random vectors from the multivariate normal distribution with the covariance matrix Σ defined as follows for the independent case Σ = I, for the correlated case Σii = 1 and Σij = 0.3 for i = j. In one of the scenarios the design matrix is created by simulating the genotypes of p independent Single Nucleotide Polymorphisms (SNPs). In this case the explanatory variables can take only three values: 0 for the homozygote for the minor allele (genotype {a,a}), 1 for the heterozygote (genotype {a,A}) and 2 for the homozygote for the major allele (genotype {A,A}). The frequencies of the minor allele for each SNP are independently drawn from the uniform distribution on the interval (0.1, 0.5). Then, given the frequency πj for j-th SNP, the explanatory variable Xij has the distribution: P(Xij = 0) = π2 j , P(Xij = 1) = 2πj(1 πj) and P(Xij = 2) = (1 πj)2. The full description of the simulation scenarios is provided below: Scenario 1 Y = Xβ + ε, where X matrix is generated according to the independent case, β1 = . . . = βp0 = 3 and the elements of ε = (ε1, . . . , εn) are independently drawn from the standard Cauchy distribution. We compare the quality of the above methods by performing 200 replicates of the experiment, where in each replicate we generate the new realization of the design matrix X and the vector of random noise ε. We start with preparing the data set using three pre-processing steps as in Wang et al. (2012): we remove each probe for which the maximum expression among 210 individuals is smaller than the 25-th percentile of the entire expression values, we remove any probe for which the range of the expression among 210 individuals is smaller than 2 and finally we select 300 genes, whose expressions are the most correlated to the expression level of the analyzed gene. Next, the data set is divided into two parts: the training set with randomly selected 180 individuals and the test set with remaining 30 individuals.