The xyz algorithm for fast interaction search in high-dimensional data

Authors: Gian-Andrea Thanei, Nicolai Meinshausen, Rajen D. Shah

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 5 contains a variety of numerical experiments on real and simulated data that complement our theoretical results and demonstrate the effectiveness of our proposal in practice. 5. Experiments To test the algorithm and theory developed in the previous sections, we run a sequence of experiments on real and simulated data.
Researcher Affiliation Academia Gian-Andrea Thanei EMAIL Seminar fur Statistik ETH Zurich 8092 Zurich, Switzerland Nicolai Meinshausen EMAIL Seminar fur Statistik ETH Zurich 8092 Zurich, Switzerland Rajen D. Shah EMAIL Statistical Laboratory University of Cambridge Cambridge, CB3 0WB, UK
Pseudocode Yes Algorithm 1 A general form of the xyz algorithm. Algorithm 2 Final version of the xyz algorithm.
Open Source Code Yes (d) We provide implementations of both the core xyz algorithm and its extension to the Lasso in the R package xyz, which is available on github (Thanei, 2016) and CRAN.
Open Datasets Yes We consider the LURIC data set (Winkelmann et al., 2001), which contains data of patients that were hospitalised for coronary angiography. Riboflavin: The Riboflavin production data set (B uhlmann et al., 2014) contains n = 71 samples and p = 4088 predictors (gene-expressions). Kemmeren: The Kemmeren (Kemmeren and et al., 2014) data set records knockouts of p = 6170 genes. Climate: The climate data set from the CNRM model from the CMIP5 model ensemble (Knutti et al., 2013) simulates the temperature of points on the northern hemisphere which is recorded in X.
Dataset Splits No The paper discusses performing experiments and calculating 'normalized out of sample squared ℓ2 2 error' in Section 5.5, which implies data splitting. However, it does not specify exact split percentages, sample counts, or reference predefined splits for reproducibility. For example, it states 'We subsample an increasing number of variables to vary the difficulty of the regression problem' but does not detail the train/test/validation split methodology.
Hardware Specification Yes We demonstrate its efficiency for application to genome-wide association studies, where more than 10^11 interactions can be screened in under 280 seconds with a single-core 1.2 GHz CPU.
Software Dependencies No We provide implementations of both the core xyz algorithm and its extension to the Lasso in the R package xyz, which is available on github (Thanei, 2016) and CRAN. We also note that it is straightforward to use a different scaling for the penalty on the interaction coefficients in (17), which may be helpful in practice. We run three different procedures to estimate the main and interaction effects. Two-stage Lasso: We fit the Lasso to the data, and then run the Lasso once more on an augmented design matrix containing interactions between all selected main effects. Complexity analysis of the Least Angle Regression (LARS) algorithm (Efron et al., 2004) suggests the computational cost would be O(np min(n, p)), making the procedure very efficient. However, as the results show, it struggles in situations such as that given by model 2, where a main effects regression will fail to select variables involved in strong interactions. Lasso with all interactions: Building the full interaction matrix and computing the standard Lasso on this augmented data matrix. Analysis of the LARS algorithm would suggest the computational complexity would be in the order O(np2 min(n, p2)). Nevertheless, for small p, this approach is feasible. xyz: This is Algorithm 3; we set the parameter L to be p in order to target the strong interactions. The experiment (seen in Figure 5) shows that xyz enjoys the favourable properties of both its competitors: it is as fast as the two-stage Lasso that gives an almost linear run time in p, and it is about as accurate as the estimator calculated from screening all pairs (brute-force). glmnet (Friedman et al., 2010) with all interactions included.
Experiment Setup Yes We consider three settings. For all three settings we have n = 1000. We let p {250, 500, 750, 1000}. Each row of X is generated i.i.d. as N(0, Σ). The magnitudes of both the main and interaction effects are chosen uniformly from the interval [2, 6] (20 main effects and 10 interaction effects) and we set εi N(0, 1). For each experiment we fix the number of runs L to p so the run time of xyz is O(np1.5).