reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Linear Regression With Unmatched Data: A Deconvolution Perspective

Authors: Mona Azadkia, Fadoua Balabdaoui

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Several applications with synthetic and real data sets are considered to illustrate the theory.
Researcher Affiliation	Academia	Mona Azadkia EMAIL Department of Statistics London School of Economics and Political Science London, United Kingdom Fadoua Balabdaoui EMAIL Department of Mathematics ETH Z urich Z urich, Switzerland
Pseudocode	No	The paper describes mathematical derivations and methodological steps in prose but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using specific R functions and packages (e.g., 'optim from the package stats of the open software R', 'function density from R package stats with hyper-parameter SJ') but does not state that the authors are releasing their own code for the methodology developed in the paper.
Open Datasets	Yes	We apply our method to data from 1850 to 1930 decennial censuses of the United States studied in Olivetti and Paserman (2015); D Haultfoeuille et al. (2022) using the 1 percent IPUMS samples (Ruggles et al., 2010). We consider the Power Plant data set from UCI Machine Learning Repository1. [Footnote 1: https://archive.ics.uci.edu/]
Dataset Splits	No	The paper describes how samples were generated or sub-sampled for experiments (e.g., '1000 independent samples... of size n = 4000', 'select a subset of size 4000', 'select a sub-sample of matched data of size m = 30'), but it does not provide specific training/validation/test splits, split percentages, or cross-validation strategies needed for reproducibility.
Hardware Specification	No	The paper does not mention any specific hardware (e.g., CPU, GPU models, memory, or cloud computing instances) used for running the experiments.
Software Dependencies	No	The paper mentions using 'the function optim from the package stats of the open software R' and 'function density from R package stats with hyper-parameter SJ'. While 'R' is a programming environment and 'stats' is a package, specific version numbers for R or the 'stats' package are not provided.
Experiment Setup	Yes	We use the default setting of the function optim from the package stats of the open software R. The default method of optimization is the method introduced by Nelder and Mead (1965). We consider two diﬀerent families of centred distributions, Normal and Laplace. We consider diﬀerent possible values for their scale parameters so that the standard deviation (sd) of the noise varies in the set {0.1, 0.2, ..., 1}. For the DLSE estimator ˆβn, we need an estimate of the noise distribution, and for this, we use a Kernel density estimator based on the residuals of OLS βm obtained using the matched data. We use a Gaussian kernel and select the bandwidth according to Sheather and Jones (1991).