reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data

Authors: Shuo-Chieh Huang, Ruey S. Tsay

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a ﬁnancial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices. To validate the performance of TSRGA, we apply it to both synthetic and real-world data sets and show that TSRGA converges much faster than other existing methods. In the simulation experiments, TSRGA achieved the smallest estimation error using the least number of iterations.
Researcher Affiliation	Academia	Shuo-Chieh Huang EMAIL Ruey S. Tsay EMAIL Booth School of Business University of Chicago Chicago, IL 60637, USA
Pseudocode	Yes	Algorithm 1: Feature-distributed relaxed greedy algorithm (RGA) Algorithm 2: Feature-distributed second-stage RGA
Open Source Code	No	The paper mentions using third-party tools like 'Open MPI' and 'mpi4py', and 'glmnet package in R', but does not provide concrete access or a statement for the specific source code of the TSRGA methodology described in this paper.
Open Datasets	Yes	All series are obtained from Yahoo! Finance via the tidyquant package in R. The corpus utilized in this application is sourced from the EDGAR-CORPUS, originally prepared by Loukas et al. (2021).
Dataset Splits	Yes	As a benchmark, we also solve the Lasso problem with 5-fold cross validation using glmnet package in R. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Speciﬁcations 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values. We reserved the last year of data [for the test set].
Hardware Specification	Yes	The algorithm runs on the high-performance computing cluster of the university, which comprises multiple computing nodes equipped with Intel Xeon Gold 6248R processors.
Software Dependencies	No	The paper mentions 'Open MPI and the Python binding mpi4py (Dalcın et al., 2005; Dalcın and Fang, 2021)', 'glmnet package in R', and 'gensim package in Python'. However, specific version numbers for these software components are not provided.
Experiment Setup	Yes	The step size of the Hydra-type algorithms is set to the lowest value so that we observe convergence of the algorithms instead of divergence. For TSRGA, we simply set Ln = 500 and tn = 1/(10 log n), and the performance is not too sensitive to these choices. For Speciﬁcations 1 and 2 below, we consider three cases with (n, pn) {(800, 1200), (1200, 2000), (1500, 3000)}. For TSRGA, Ln is set to 105, and we hold one third of the training data as validation set to select the tuning parameter tn for TSRGA over a grid of values1. tn is selected among t = (0.01, 0.07, 1.10, 1.39, 1.61, 1.79, 1.95, 2.08, 2.20, 2.30)/ log n.