Least Squares Model Averaging for Distributed Data

Authors: Haili Zhang, Zhaobo Liu, Guohua Zou

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation results and a real airline data analysis illustrate that the proposed model averaging methods perform better than the commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to model averaging theory in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce the computational burden.
Researcher Affiliation Academia Haili Zhang EMAIL Institute of Applied Mathematics Shenzhen Polytechnic University Shenzhen, 518055, China. Zhaobo Liu EMAIL Institute for Advanced Study Shenzhen University Shenzhen, 518060, China. Guohua Zou EMAIL School of Mathematical Sciences Capital Normal University Beijing, 100048, China.
Pseudocode No The paper describes the methodology using mathematical equations and text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to code repositories for the methodology described.
Open Datasets Yes In this section, we use our proposed distributed model averaging methods to analyze the airline on-time performance data from the 2009 ASA Data Expo (http://stat-computing.org/dataexpo/2009/the-data.html). The data set is publicly available and has been used for demonstration with big data in many papers.
Dataset Splits Yes We use the ith subject data as training data to predict the late time at the (i + 1)th subject data, i = 1, 2, . . . , 123. For the ith subject, we apply simple random sampling scheme without replacement to the data and get K random samples, then we use our proposed distributed model averaging methods for data analysis.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments or simulations.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes For the distributed data, we set the sample size for each subject to be varied at n = 50, 150, 400, 1000, 5000 and 10000. The number of subjects is set as K = 1, 2, 3, 5 and 10. Let p S equal to 4n1/2 + ( [ ]+ means round to get an integer, and so p S = 28, 49, 80, 126, 283 and 400 for the above six sample sizes), and the number of candidate models S be n1/3 +1( means round up to get an integer, and so S = 5, 7, 9, 11, 19 and 22 for the six sample sizes). All the candidate models are nested and the dimension for the sth candidate model is 1 + d (s 1) , where d = (p S 1)/(S 1) and s = 1, 2, . . . , S 1, while the dimension for the Sth candidate model is p S.