Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
Authors: Shuo-Chieh Huang, Ruey S. Tsay
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices. To validate the performance of TSRGA, we apply it to both synthetic and real-world data sets and show that TSRGA converges much faster than other existing methods. In the simulation experiments, TSRGA achieved the smallest estimation error using the least number of iterations. |
| Researcher Affiliation | Academia | Shuo-Chieh Huang EMAIL Ruey S. Tsay EMAIL Booth School of Business University of Chicago Chicago, IL 60637, USA |
| Pseudocode | Yes | Algorithm 1: Feature-distributed relaxed greedy algorithm (RGA) Algorithm 2: Feature-distributed second-stage RGA |
| Open Source Code | No | The paper mentions using third-party tools like 'Open MPI' and 'mpi4py', and 'glmnet package in R', but does not provide concrete access or a statement for the specific source code of the TSRGA methodology described in this paper. |
| Open Datasets | Yes | All series are obtained from Yahoo! Finance via the tidyquant package in R. The corpus utilized in this application is sourced from the EDGAR-CORPUS, originally prepared by Loukas et al. (2021). |
| Dataset Splits | Yes | As a benchmark, we also solve the Lasso problem with 5-fold cross validation using glmnet package in R. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values. We reserved the last year of data [for the test set]. |
| Hardware Specification | Yes | The algorithm runs on the high-performance computing cluster of the university, which comprises multiple computing nodes equipped with Intel Xeon Gold 6248R processors. |
| Software Dependencies | No | The paper mentions 'Open MPI and the Python binding mpi4py (Dalcın et al., 2005; Dalcın and Fang, 2021)', 'glmnet package in R', and 'gensim package in Python'. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | The step size of the Hydra-type algorithms is set to the lowest value so that we observe convergence of the algorithms instead of divergence. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Specifications 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values1. tn is selected among t = (0.01, 0.07, 1.10, 1.39, 1.61, 1.79, 1.95, 2.08, 2.20, 2.30)/ log n. |