Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data

Authors: Shuo-Chieh Huang, Ruey S. Tsay

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices. To validate the performance of TSRGA, we apply it to both synthetic and real-world data sets and show that TSRGA converges much faster than other existing methods. In the simulation experiments, TSRGA achieved the smallest estimation error using the least number of iterations.
Researcher Affiliation Academia Shuo-Chieh Huang EMAIL Ruey S. Tsay EMAIL Booth School of Business University of Chicago Chicago, IL 60637, USA
Pseudocode Yes Algorithm 1: Feature-distributed relaxed greedy algorithm (RGA) Algorithm 2: Feature-distributed second-stage RGA
Open Source Code No The paper mentions using third-party tools like 'Open MPI' and 'mpi4py', and 'glmnet package in R', but does not provide concrete access or a statement for the specific source code of the TSRGA methodology described in this paper.
Open Datasets Yes All series are obtained from Yahoo! Finance via the tidyquant package in R. The corpus utilized in this application is sourced from the EDGAR-CORPUS, originally prepared by Loukas et al. (2021).
Dataset Splits Yes As a benchmark, we also solve the Lasso problem with 5-fold cross validation using glmnet package in R. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values. We reserved the last year of data [for the test set].
Hardware Specification Yes The algorithm runs on the high-performance computing cluster of the university, which comprises multiple computing nodes equipped with Intel Xeon Gold 6248R processors.
Software Dependencies No The paper mentions 'Open MPI and the Python binding mpi4py (Dalcın et al., 2005; Dalcın and Fang, 2021)', 'glmnet package in R', and 'gensim package in Python'. However, specific version numbers for these software components are not provided.
Experiment Setup Yes The step size of the Hydra-type algorithms is set to the lowest value so that we observe convergence of the algorithms instead of divergence. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values1. tn is selected among t = (0.01, 0.07, 1.10, 1.39, 1.61, 1.79, 1.95, 2.08, 2.20, 2.30)/ log n.