reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rank-based Lasso - efficient methods for high-dimensional robust model selection

Authors: Wojciech Rejchel, Małgorzata Bogdan

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical results are supported by the simulation study and the real data analysis, which show that our methods can properly identify relevant predictors, even when the error terms come from the Cauchy distribution and the link function is nonlinear. They also demonstrate the superiority of the modiﬁed versions of Rank Lasso over its regular version in the case when predictors are substantially correlated. The numerical study shows also that Rank Lasso performs substantially better in model selection than LADLasso, which is a well established methodology for robust model selection.
Researcher Affiliation	Academia	Wojciech Rejchel EMAIL Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/18, 87-100 Toru n, Poland Ma lgorzata Bogdan EMAIL Faculty of Mathematics and Computer Science University of Wroc law Joliot-Curie 15, 50-383 Wroc law, Poland and Department of Statistics Lund University Tycho Brahes v ag 1, Lund, Sweden
Pseudocode	No	Our procedure is very simple and relies on replacing actual values of the response variables Yi by their centred ranks. Ranks Ri are deﬁned as j=1 I(Yj Yi), i = 1, . . . , n, (4) where I( ) is the indicator function. Next, we identify signiﬁcant predictors by simply solving the following Lasso problem; Rank Lasso: ˆθ = arg min θ Rp Q(θ) + λ \|θ\|1 , (5) Ri/n 0.5 θ Xi 2 . (6) This procedure does not require any dedicated algorithm and can be executed using eﬃcient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Open Source Code	No	This procedure does not require any dedicated algorithm and can be executed using eﬃcient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Open Datasets	Yes	The considered data set gene expression was interrogated in lymphoblastoid cell lines of 210 unrelated Hap Map individuals The International Hap Map Consortium (2005) from four populations (60 Utah residents with ancestry from northern and western Europe, 45 Han Chinese in Beijing, 45 Japanese in Tokyo, 60 Yoruba in Ibadan, Nigeria) Stranger et al. (2007). The data set can be found at ftp://ftp.sanger.ac.uk/pub/genevar/ and was previously studied e.g. in Bradic et al. (2011); Fan et al. (2014).
Dataset Splits	Yes	Next, the data set is divided into two parts: the training set with randomly selected 180 individuals and the test set with remaining 30 individuals.
Hardware Specification	No	We gratefully acknowledge the grant of the Wroclaw Center of Networking and Supercomputing (WCSS), where most of the computations were performed.
Software Dependencies	Yes	This procedure does not require any dedicated algorithm and can be executed using eﬃcient implementations of Lasso in ,,R (R Development Core Team, 2017) packages: ,,lars (Efron et al., 2004) or ,,glmnet (Friedman et al., 2010).
Experiment Setup	Yes	More speciﬁcally, we consider the following pairs (n, p) : (100, 100), (200, 400), (300, 900), (400, 1600). For each of these combinations we consider three diﬀerent values of the sparsity parameter p0 = #{j : βj = 0} {3, 10, 20}. In three of our simulation scenarios the rows of the design matrix are generated as independent random vectors from the multivariate normal distribution with the covariance matrix Σ deﬁned as follows for the independent case Σ = I, for the correlated case Σii = 1 and Σij = 0.3 for i = j. In one of the scenarios the design matrix is created by simulating the genotypes of p independent Single Nucleotide Polymorphisms (SNPs). In this case the explanatory variables can take only three values: 0 for the homozygote for the minor allele (genotype {a,a}), 1 for the heterozygote (genotype {a,A}) and 2 for the homozygote for the major allele (genotype {A,A}). The frequencies of the minor allele for each SNP are independently drawn from the uniform distribution on the interval (0.1, 0.5). Then, given the frequency πj for j-th SNP, the explanatory variable Xij has the distribution: P(Xij = 0) = π2 j , P(Xij = 1) = 2πj(1 πj) and P(Xij = 2) = (1 πj)2. The full description of the simulation scenarios is provided below: Scenario 1 Y = Xβ + ε, where X matrix is generated according to the independent case, β1 = . . . = βp0 = 3 and the elements of ε = (ε1, . . . , εn) are independently drawn from the standard Cauchy distribution. We compare the quality of the above methods by performing 200 replicates of the experiment, where in each replicate we generate the new realization of the design matrix X and the vector of random noise ε. We start with preparing the data set using three pre-processing steps as in Wang et al. (2012): we remove each probe for which the maximum expression among 210 individuals is smaller than the 25-th percentile of the entire expression values, we remove any probe for which the range of the expression among 210 individuals is smaller than 2 and ﬁnally we select 300 genes, whose expressions are the most correlated to the expression level of the analyzed gene. Next, the data set is divided into two parts: the training set with randomly selected 180 individuals and the test set with remaining 30 individuals.