reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Semi-Supervised Linear Regression in Covariate Shift Problems

Authors: Kenneth Joseph Ryan, Mark Vere Culp

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Performance is validated on simulated and real data. Keywords: joint optimization, semi-supervised regression, usefulness of unlabeled data ... The geometry helps articulate realistic assumptions for the theoretical risk results in Section 5, and the theoretical risk results help deﬁne informative simulations and real data tests in Section 6. In addition, the simulations and real data applications validate the theoretical risk results.
Researcher Affiliation	Academia	Kenneth Joseph Ryan EMAIL Mark Vere Culp EMAIL Department of Statistics West Virginia University Morgantown, WV 26506, USA
Pseudocode	No	The paper includes mathematical formulations and theoretical derivations, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The Elastic Net Optimization Problem (7) is convex and can be solved quickly by the glmnet package in R (Friedman et al., 2010; R Core Team, 2015), so this helps make our semi-supervised adjustment computationally viable.
Open Datasets	Yes	The 10 tests listed in Table 4 were constructed using 8 publicly available data sets and a simulated toy extrapolation data set. Each is expected to have a covariate shifted empirical feature data distribution either because the characteristic used to deﬁne the labeled set is associated with other variables in the model matrix, because of the curse of dimensionality, or because the simulated toy data were generated from a model with covariate shift. ... Table 4: These ten covariate shift tests are used to establish benchmarks in Table 5. Data Set (n, p) ... Data Set Source Toy Cov. Shift (1200, 1) Sugiyama et al. (2007) Auto-MPG (398, 8) Lichman (2013) ... Eye (120, 200) Rats 1-30 Express Scheetz et al. (2006) ... Ethanol (589, 1037) Sols. 1-294 Ethanol Shen et al. (2013)
Dataset Splits	Yes	For K-fold cross-validation in the semi-supervised setting, the L cases were partitioned into K folds, {Lk}K k=1. ... The JT-ENET estimate bβˆγ,ˆλ minimized bσ2 3 over the grid for λ1/(λ1 + 2λ2), γ1, and γ2. ... This particular implementation is optimized for estimating λ1 + 2λ2 with 10-fold cross validation given λ1/(λ1 + 2λ2).
Hardware Specification	Yes	Cross-validation took an average of 3.5 minutes per data set on a 2.6 GHz Intel Core i7 Power Mac. ... The JT-ENET ﬁt fairly quickly on a 2.6 GHz Intel Core i7 Power Mac.
Software Dependencies	Yes	The Elastic Net Optimization Problem (7) is convex and can be solved quickly by the glmnet package in R (Friedman et al., 2010; R Core Team, 2015), so this helps make our semi-supervised adjustment computationally viable. ... The caret package in R (Kuhn, 2008) was also used to ﬁt the SVM with a polynomial kernel on the real data examples.
Experiment Setup	Yes	This particular implementation is optimized for estimating λ1 + 2λ2 with 10-fold cross validation given λ1/(λ1 + 2λ2). First, the supervised elastic net was implemented by varying λ1/(λ1 +2λ2) [0, 1] over an equally spaced grid of length 57 to optimize parameters λ. Second, the semi-supervised JT-ENET was implemented by estimating its parameters (λ, γ) simultaneously. ... Parameter λ1/(λ1 + 2λ2) was optimized over the grid {0, 0.25, 0.5, 0.75, 1, ˆa}, where ˆa was the optimal supervised setting for this parameter. Fixed grids γ1 ν 1 and γ2 ν were used for the other parameters, where ν = {0.1, 0.5, 1, 10, 100, 1000, 10000, } and ν 1 = {1/r : r ν}.