Multiple Output Regression with Latent Noise

Authors: Jussi Gillberg, Pekka Marttinen, Matti Pirinen, Antti J. Kangas, Pasi Soininen, Mehreen Ali, Aki S. Havulinna, Marjo-Riitta Järvelin, Mika Ala-Korpela, Samuel Kaski

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulations and prediction experiments with metabolite, gene expression, FMRI measurement, and macroeconomic time series data show that our model equals or exceeds the state-of-the-art performance and, in particular, outperforms the standard approach of assuming independent noise and signal models. Keywords: Bayesian reduced-rank regression, latent variable models, latent signal-to-noise ratio, multiple-output regression, nonparametric Bayes, shrinkage priors, structured noise, weak effects
Researcher Affiliation Collaboration Jussi Gillberg EMAIL Pekka Marttinen EMAIL Helsinki Institute for Information Technology HIIT Department of Computer Science PO Box 15600, Aalto University, 00076 Aalto, Finland; Matti Pirinen EMAIL Institute for Molecular Medicine Finland (FIMM) University of Helsinki, Finland; Antti J. Kangas EMAIL Pasi Soininen EMAIL Computational Medicine Faculty of Medicine University of Oulu & Biocenter Oulu, Oulu, Finland; Marjo-Riitta J arvelin EMAIL Department of Epidemiology and Biostatistics MRC-PHE Centre for Environment & Health, School of Public Health, Imperial College London, UK; Disclosure: AJK, PS and MAK are shareholders of Brainshake Ltd., a company offering NMR-based metabolite profiling.
Pseudocode No The paper describes inference methods using Gibbs sampling and discusses computational complexity but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures displaying structured steps.
Open Source Code Yes Code in R for the new method is available for download at //http://research.cs.aalto.fi/pml/software/latent Noise/.
Open Datasets Yes NFBC1966 [N = 4702, P = 101, K = 96, metabolomics prediction from SNPs] The NFBC1966 data set comprises genome-wide SNP data along with metabolomics measurements for a cohort of 4,702 individuals (Rantakallio, 1969; Soininen et al., 2009).; DILGOM [N = 509, P = 65, K = 18 . . . 137, metabolomics and gene expression prediction from SNPs] The DILGOM data set (Inouye et al., 2010) consists of genomewide SNP data along with metabolomics and gene expression measurements.; fMRI [N = 1307, P = 776, K = 250, f MRI response prediction from text stimuli] The cognitive neuroscience data set (Wehbe et al., 2014) consists of a time series of f MRI measurements from 8 subjects reading a chapter from Harry Potter and the Sorcerers Stone using Rapid Serial Visual Presentation: words of the text are presented one by one in the center of a screen.; econ [N = 120, P = 52, K = 52, macroeconomic time series prediction] The macroeconomic time series data set (Stock and Watson, 2006) consists of monthly values of 52 macroeconomic indicators.
Dataset Splits Yes For this data set, the comparison method GFlasso required excessive training time and we used 5-fold cross-validation to evaluate test set performances. Where cross-validation was needed for selecting model parameter values, the validation data performance was measured as an average over 3 validation sets, each comprising 1/10 of the training samples.; On these data sets, 10-fold cross-validation was used to evaluate test set performances. To select values of the parameters that required evaluation on validation data, the training data was then further divided into 9 folds, on which cross-validation was performed to select parameters according to averaged validation set performance.
Hardware Specification No No specific hardware details (like CPU/GPU models or memory) are mentioned in the paper. The acknowledgments section refers to 'computational resources provided by the Aalto Science-IT project' which is too general to count as a specific hardware specification.
Software Dependencies No The paper mentions implementing code in R and using packages like 'glmnet' and 'PEER software' but does not specify any version numbers for R or the used libraries. This lack of versioning makes the software dependencies unreproducible.
Experiment Setup Yes Hyperparameters a1 and a2 of all the BRRR models were fixed to 10 and 4, respectively. In total 1,000 MCMC samples were generated and 500 were discarded as burn-in. The remaining samples, thinned by a factor of 10, were used for prediction.; With the NFBC1966 data, the latent signal-to-noise ratio β was selected using cross-validation from a range of values from 100 to 1/100, β = {100, 10, 2, 1, 1/60, 1/100}, in order to thoroughly evaluate the sensitivity of the model to this parameter.; The mixture parameter α controlling the balance between L1 and L2 regularization was evaluated on the grid [0, 0.1, . . ., 0.9, 1.0] and selected using a 10-fold cross validation.; Kernel ridge regression was regularized according to the standard approach of adding parameter λ to the diagonal elements of the kernel. The value of λ was selected using cross-validation from a set of 10 values ranging from 0.1 to 100, [10^-1, 10^-0.66, . . . , 10^1.67, 10^2].