reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Constrained Dantzig Selector with Enhanced Consistency

Authors: Yinfei Kong, Zemin Zheng, Jinchi Lv

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical studies conﬁrm that the sample size needed for a certain level of accuracy in these problems can be much reduced. In Section 4, we discuss the implementation of the method and present several simulation and real data examples. Figure 1 presents the probabilities of exact recovery of sparse β0 based on 100 simulations by all methods. Table 1 summarizes the comparison results by all methods. We applied the same methods as in Section 4.2 to two real data sets: one real PCR data set and another gene expression data set, both in the high-dimensional setting with relatively small sample size.
Researcher Affiliation	Academia	Yinfei Kong EMAIL Department of Information Systems and Decision Sciences Mihaylo College of Business and Economics California State University at Fullerton Fullerton, CA 92831, USA; Zemin Zheng EMAIL Department of Statistics and Finance University of Science and Technololgy of China Hefei, Anhui 230026, China; Jinchi Lv EMAIL Data Sciences and Operations Department Marshall School of Business University of Southern California Los Angeles, CA 90089, USA
Pseudocode	Yes	We name this algorithm as the CDS algorithm which is detailed in four steps below. 1. For a ﬁxed λ1 in the grid, denote by bβ (0) λ1 the initial value. Let bβ (0) λ1 be zero when λ1 = n 1XT y , and the estimate from previous λ1 in the grid otherwise. 2. Denote by bβ (k) λ1 the estimate from the kth iteration. Deﬁne the active set A as the support of bβ (k) λ1 and Ac its complement. Let b be a vector with constant components λ0 on A and λ1 on Ac. For the (k + 1)th iteration, update A as A {j Ac : \|n 1x T j (y XbβA)\| > λ1}, where the subscript A indicates a subvector restricted on A. Solve the following linear program on the new set A: bβA = argmin βA 1 subject to \|n 1XT A(y XAβA)\| b A, (5) where is understood as componentwise no larger than and the subscript A also indicates a submatrix with columns corresponding to A. For the solution obtained in (5), set all its components smaller than λ in magnitude to zero. 3. Update the active set A as the support of bβA. Solve the Dantzig selector problem on this active set with λ0 as the regularization parameter: bβA = argmin βA 1 subject to n 1XT A(y XAβA) λ0. (6) Let bβ (k+1) A = bβA and bβ (k+1) Ac = 0, which give the solution for the (k + 1)th iteration. 4. Repeat steps 2 and 3 until convergence for a ﬁxed λ1 and record the estimate from the last iteration as bβλ1. Jump to the next λ1 if bβλ1 Bλ, and stop the algorithm otherwise.
Open Source Code	No	The paper describes an algorithm for the constrained Dantzig selector but does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets	Yes	The real PCR data set, originally studied in Lan et al. (2006), examines the genetics of two inbred mouse populations. This data set is comprised of n = 60 samples with 29 males and 31 females. Expression levels of 22, 575 genes were measured. Following Song and Liang (2015), we study the linear relationship between the numbers of Phosphoenolpyruvate carboxykinase (PEPCK), a phenotype measured by quantitative real-time PCR, and the gene expression levels. ... The second data set has been studied in Scheetz et al. (2006) and Huang et al. (2008). In this data set, 120 twelve-week-old male rats were selected for tissue harvesting from the eyes and for microarray analysis.
Dataset Splits	Yes	The data set was randomly split into a training set of 55 samples and a test set with the remaining 5 samples for 100 times. ... The training set contains 100 samples and was sampled randomly 100 times from the full data set. The remaining 20 samples at each time served as the test set.
Hardware Specification	No	The paper mentions numerical studies and simulations but does not specify any particular hardware used for running these experiments.
Software Dependencies	No	The paper does not explicitly list any software dependencies with specific version numbers, such as programming languages, libraries, or computational frameworks used for the experiments.
Experiment Setup	Yes	We suggest some ﬁxed values for λ0 and λ to simplify the computation, since the proposed method is generally not that sensitive to λ0 and λ as long as they fall in certain ranges. In simulation studies to be presented, a value around {(log p)/n}1/2 for λ and a smaller value for λ0, say 0.05{(log p)/n}1/2 or 0.1{(log p)/n}1/2, can provide us nice prediction and estimation results. ... Let λ0 and λ be in two small grids {0.001, 0.005, 0.01, 0.05, 0.1} and {0.05, 0.1, 0.15, 0.2}, respectively. ... We set the grid of values for λ1 as described in Section 4.1. ... We set λ0 = 0.01 and λ = 0.2 for our method. ... The tuning parameter λ1 was chosen by cross-validation, similar as in Example 2. ... We set λ0 = 0.001 and λ = 0.02 in this real data analysis as well as the other one below for conservativeness.