reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Least Squares Model Averaging for Distributed Data

Authors: Haili Zhang, Zhaobo Liu, Guohua Zou

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulation results and a real airline data analysis illustrate that the proposed model averaging methods perform better than the commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to model averaging theory in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce the computational burden.
Researcher Affiliation	Academia	Haili Zhang EMAIL Institute of Applied Mathematics Shenzhen Polytechnic University Shenzhen, 518055, China. Zhaobo Liu EMAIL Institute for Advanced Study Shenzhen University Shenzhen, 518060, China. Guohua Zou EMAIL School of Mathematical Sciences Capital Normal University Beijing, 100048, China.
Pseudocode	No	The paper describes the methodology using mathematical equations and text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to code repositories for the methodology described.
Open Datasets	Yes	In this section, we use our proposed distributed model averaging methods to analyze the airline on-time performance data from the 2009 ASA Data Expo (http://stat-computing.org/dataexpo/2009/the-data.html). The data set is publicly available and has been used for demonstration with big data in many papers.
Dataset Splits	Yes	We use the ith subject data as training data to predict the late time at the (i + 1)th subject data, i = 1, 2, . . . , 123. For the ith subject, we apply simple random sampling scheme without replacement to the data and get K random samples, then we use our proposed distributed model averaging methods for data analysis.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments or simulations.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	For the distributed data, we set the sample size for each subject to be varied at n = 50, 150, 400, 1000, 5000 and 10000. The number of subjects is set as K = 1, 2, 3, 5 and 10. Let p S equal to 4n1/2 + ( [ ]+ means round to get an integer, and so p S = 28, 49, 80, 126, 283 and 400 for the above six sample sizes), and the number of candidate models S be n1/3 +1( means round up to get an integer, and so S = 5, 7, 9, 11, 19 and 22 for the six sample sizes). All the candidate models are nested and the dimension for the sth candidate model is 1 + d (s 1) , where d = (p S 1)/(S 1) and s = 1, 2, . . . , S 1, while the dimension for the Sth candidate model is p S.