reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diagnostic Tool for Out-of-Sample Model Evaluation

Authors: Ludvig Hult, Dave Zachariah, Peter Stoica

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyperparameter tuning.
Researcher Affiliation	Academia	Ludvig Hult EMAIL Department of Information Technology Uppsala University Dave Zachariah EMAIL Department of Information Technology Uppsala University Petre Stoice EMAIL Department of Information Technology Uppsala University
Pseudocode	No	The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor are there structured code-like blocks describing procedures.
Open Source Code	Yes	Code to reproduce all experiments can be found at https://github.com/el-hult/lal.
Open Datasets	Yes	The data set consist of California housing prices from the 1990 census (Kelley Pace & Barry, 1997)... We use the Palmer Penguin data set, popularized by Horst et al. (2020)... We use the UCI Airfoil data set (Dua & Graff, 2017)... The data used is the monthly number of earthquakes worldwide with magnitude ≥ 5 between 2012 and 2022 (USGS, 2022)... The data set is the MNIST handwritten digits (Bottou et al., 1994).
Dataset Splits	Yes	The training data set D0 has n0 = 15 000 sampled without replacement. The calibration data set D has n = 150 and is sampled without replacement from the remaining data. (Section 4.1) The training data set D0 was created with n0 = 100... We analyze the case of out-of-sample batch size m = 1. The performance of the model is evaluated using the absolute error loss: ... We use a calibration data set D1 for which n = 30. In the second case... we use a calibration data set D2 also having n = 30. (Section 4.2) We use a training data set D0 of n0 = 150, leaving 183 samples for calibration. Using D0, we fit f(x) via multinomial logistic regression using L2-regularization and cross-validation. The model output is a threedimensional vector f(x) = [f1(x), f3(x), f3(x)]T approximating the conditional probabilities, so that fi(x) approximates P[Y = i\|X = x]. ... Two different calibration data sets D1 and D2 of sample size n = 50 were constructed. (Section 4.3) The calibration data D is constructed by weighted sampling of n = 100 samples. The probability to draw a data point (xi, yi) is proportional to exp(− 1/1000 xi), making data points with high frequency and small displacement more likely to sample, similar to the distribution shift experiments in Tibshirani et al. (2020, sec. 2.2). The remaining 1 403 data points constitute the training data set D0. (Section 4.4) There are 120 data points z1, . . . , z120, with zi ∈ {0, 1, 2, . . . } These are randomly split into 100 and 20 data points, forming D0 and D. (Section 4.5) The training data set D0 has n0 = 6 · 104 data points, and the calibration data set D has n = 104 data points. (Section 4.6)
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU or CPU models, memory sizes) used for running the experiments.
Software Dependencies	No	The paper mentions software components like "random Fourier feature basis", "L2-regularized least squares regression", "multinomial logistic regression", "ReLU activations", "softmax output", and "Adam optimizer". However, no specific version numbers for these software packages or the programming language used are provided.
Experiment Setup	Yes	The hyperparameters (number of basis functions K, bandwidth b and regularization strength λ) are tuned by five-fold cross-validation on the training data. The loss function is the absolute error expressed in dollars ℓ(x, y) = \|y − f(x)\|. (Section 4.1) The optimization was run for 100 epochs, employing a batch size of 1024 and learning rate 0.01. (Section 4.6)