reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sufficient reductions in regression with mixed predictors

Authors: Efstathia Bura, Liliana Forzani, Rodrigo Garcia Arancibia, Pamela Llop, Diego Tomassi

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the performance of the proposed method and compare it with other approaches through simulations and real data examples. Section 5 contains an extensive simulation study that demonstrates the competitive performance of our approach. Furthermore, we show the superior performance of our methods as compared with generalized linear models and a version of principal component regression that allows for mixed predictors in the analysis of three data sets in Section 6.
Researcher Affiliation	Academia	Efstathia Bura EMAIL Institute of Statistics and Mathematical Methods in Economics Faculty of Mathematics and Geoinformation TU Wien Vienna, 1040, Austria; Liliana Forzani EMAIL Facultad de Ingenier ıa Qu ımica Universidad Nacional del Litoral Researcher of CONICET Santa Fe, Argentina; Rodrigo Garc ıa Arancibia EMAIL Instituto de Econom ıa Aplicada Litoral-FCE-UNL Universidad Nacional del Litoral Researcher of CONICET Santa Fe, Argentina; Pamela Llop EMAIL Facultad de Ingenier ıa Qu ımica Universidad Nacional del Litoral Researcher of CONICET Santa Fe, Argentina; Diego Tomassi EMAIL Facultad de Ingenier ıa Qu ımica Universidad Nacional del Litoral Researcher of CONICET Santa Fe, Argentina
Pseudocode	No	The paper describes its methodology using mathematical formulations and prose. It does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	The R code we used in both simulations and real data analyses in Section 6 can be found at https://github.com/lforzani/SDR mixed predictions.
Open Datasets	Yes	Krzanowski (1975) studied the problem of discriminating between two groups... We analyze four of the ﬁve data sets in Krzanowski s paper... Governance Indicators and per capita GDP data can be downloaded from Worldwide Governance Indicators and The World Bank Data, respectively.
Dataset Splits	Yes	The prediction error is computed as \|\|PαT (XN,HN) PbαT (XN,HN)\|\|2, where (XN, HN) is a new sample of size N = 2000 that is independent of the training sample. In Table 4 we report the leave-one-out misclassiﬁcation rate... The average of the leave-one-out mean square prediction errors of the linear and kernel regression models in Table 5, provides an unbiased measure of predictive performance.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It focuses on the methodology and results without specifying details such as GPU models, CPU types, or other computational resources.
Software Dependencies	No	The R code we used in both simulations and real data analyses in Section 6 can be found at https://github.com/lforzani/SDR mixed predictions. Using the np R package, the value of the nonparametric version of R2 is 0.32 for the PFC-based CG index, which is much lower than 0.54, the value for the PCA-based index.
Experiment Setup	Yes	In all our simulations, the response is generated from the uniform distribution on the integers {1, . . . , r +1}, with r = 5, and we set fy = I(y = j) nj/n, where I is the indicator function, n denotes the total sample size and nj the number of observations in category j, for j = 1, . . . , r. All reported results are based on sample sizes n = 100, 200, 300, 500, 750, and 100 repetitions. Selection of the hyperparameters (λ, γ) in (39) is carried out via 10-fold cross validation and minimizing the prediction error as optimization criterion. The procedure starts by estimating an upper bound λm so that the whole estimate vanishes for any λ > λm. We then set a grid of nλ candidate values for λ, uniformly spaced on a logarithmic scale between 0 and λm. Here, we set nλ = 100. For γ, we consider 11 values uniformly spaced in [0, 1].