Distributed Statistical Inference under Heterogeneity

Authors: Jia Gu, Song Xi Chen

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The purpose of this section is to examine the numerical performances of the estimators via both the simulation study in Section 6.1 and the real data analysis in Section 6.2.
Researcher Affiliation Academia Jia Gu EMAIL Center for Statistical Science Peking University Bejing, China; Song Xi Chen EMAIL School of Mathematical Science, Guanghua School of Management and Center for Statistical Science, Peking University Beijing, China
Pseudocode Yes The procedure to obtain the weighted distributed estimator is summarized in Algorithm 1. Input: Distributed datasets: {Xk,i, k = 1, ..., K; i = 1, ..., nk} Output: Weighted distributed estimator: ˆφWD 1 In each data block k (k = 1, 2, , K): 2 Solve (2) and obtain ˆθk = (ˆφk, ˆλk) ; 3 Calculate b Hk(ˆθk), which is the leading principal sub-matrix of order p1 of ( θk bΨθk) 1(n 1 k Pnk i=1 Z(Xk,i; ˆθk))( θk bΨθk) T , where Z(x, θk) is defined in Assumption 6 and bΨθk = n 1 k Pnk i=1 ψθk(Xk,i; ˆθk); 4 In a central server: 5 Collect (ˆφk, b Hk(ˆθk) 1) from all the K data blocks; 6 Calculate ˆφ = PK k=1 nk b Hk(ˆθk) 1 1 PK k=1 nk( b Hk(ˆθk)) 1 ˆφk ; 7 ˆφWD = ˆφI(ˆφ Φ) + ˆφSa CI(ˆφ Φ), where ˆφSa C = N 1 PK k=1 nk ˆφk. Algorithm 1: Weighted Distributed estimator
Open Source Code No The paper does not contain any explicit statements about making the code open source, nor does it provide links to a code repository or mention code in supplementary materials for the methodology described.
Open Datasets Yes The flight data are available from https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009 and the weather data are obtained from https://cds.climate.copernicus.eu/.
Dataset Splits Yes We segmented the full data of N = 2412782 according to the airports of departing flights and obtained 10 data segments. For each segment, we split it to data blocks of size n = 5000, while the residual data blocks were discarded, such that the total number of blocks K = 479.
Hardware Specification Yes Throughout the simulation experiments, the results of each simulation setting were based on B = 500 number of replications and were conducted in R with a 10-core Intel(R) Core(TM) i9-10900K @3.7 GHz processor.
Software Dependencies No Throughout the simulation experiments, the results of each simulation setting were based on B = 500 number of replications and were conducted in R with a 10-core Intel(R) Core(TM) i9-10900K @3.7 GHz processor. This only specifies 'R' without a version number or any other software packages with versions.
Experiment Setup Yes For each of K data blocks with K {10, 50, 100, 250, 500, 1000, 2000}, {(Xk,i; Yk,i)}n i=1 Rp {0, 1} were independently sampled from the following model: Xk,i N(0p 1, 0.752Ip p) and P(Yk,i = 1 | Xk,i) = exp(XT k,iθ k) 1 + exp(XT k,iθ k), where θ k = (φ T , λ T k )T , φ = 1, λ k = (λ k,1, λ k,2, , λ k,p2)T and λ k,j = ( 1)j10(1 2(k 1)/(K 1)). The sample sizes of the data blocks were equal at n = NK 1 with N = 2 106. Two levels of the dimension p2 = 4 and 10 of the nuisance parameter λk were considered.