reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed High-dimensional Regression Under a Quantile Loss Function

Authors: Xi Chen, Weidong Liu, Xiaojun Mao, Zhuoyi Yang

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The simulation analysis is provided to demonstrate the eﬀectiveness of our method. Keywords: Distributed estimation, high-dimensional linear model, quantile loss, robust estimator, support recovery
Researcher Affiliation	Academia	Xi Chen EMAIL Stern School of Business New York University, New York, NY 10012, USA; Weidong Liu EMAIL School of Mathematical Sciences and Mo E Key Lab of Artiﬁcial Intelligence Shanghai Jiao Tong University, Shanghai, 200240, China; Xiaojun Mao EMAIL School of Data Science Fudan University, Shanghai, 200433, China; Zhuoyi Yang EMAIL Stern School of Business New York University, New York, NY 10012, USA
Pseudocode	Yes	Algorithm 1 Distributed high-dimensional QR estimator
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of the code developed for the methodology described.
Open Datasets	No	We consider the following linear model Yi = XT i β + ei, i = 1, 2, . . . , n, where XT i = (1, Xi,1, . . . , Xi,p) is a (p + 1)-dimensional covariate vector and (Xi,1, . . . , Xi,p)s are drawn i.i.d. from a multivariate normal distribution N(0, Σ). The paper uses synthetic data generated according to this model and does not provide access information for any public datasets.
Dataset Splits	No	The paper describes generating synthetic data and varying parameters like sample size (n) and local sample size (m) for simulations. While it discusses data distribution across 'L' machines, this relates to the distributed computing setup, not explicit training/validation/test splits of a specific dataset for reproducibility. For example, it states: "We ﬁx the sample size n = 10000, local sample size m = 500, the sparsity level s = 20 and dimension p = 500."
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It only contains a general statement in the introduction: "For example, a personal computer usually has a limited memory size in GBs".
Software Dependencies	No	The paper mentions: "In our experiments, we adopt the PSSgb optimization method for solving (18)." and "To solve the ℓ1-regularized QR estimator, we formulate it into a standard linear programming problem (LP) and solve it by Gurobi (Gurobi Optimization, 2020), which is the state-of-the-art LP solver." While Gurobi is mentioned with a year, a specific version number is not provided, and PSSgb lacks any version information. Therefore, a fully reproducible description of ancillary software with specific version numbers is not present for all key components.
Experiment Setup	Yes	We ﬁx the sample size n = 10000, local sample size m = 500, the sparsity level s = 20 and dimension p = 500. We plot the ℓ2-error from the true QR coeﬃcients versus the number of iterations. Since the Avg-DC only requires one-shot communication, we use a horizontal line to show its performance. The results are shown in Figure 1. From the result, both pooled REL and distributed REL outperform the Avg-DC algorithm and become stable after a few iterations. Therefore, for the rest of the experiments in this section, we use 50 as the number of iterations in the algorithm. Moreover, the distributed REL almost matches the performance of pooled REL for all three noises.