reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Asymptotics of K-Fold Cross Validation

Authors: Jessie Li

JAIR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Monte Carlo simulations demonstrate the asymptotic validity of our conﬁdence intervals for the expected out-of-sample error and investigate the size and power properties of our test. In our empirical application, we use our estimator selection test to compare the out-of-sample predictive performance of OLS, Neural Networks, and Random Forests for predicting the sale price of a domain name in a Go Daddy expiry auction.
Researcher Affiliation	Academia	Jessie Li EMAIL Department of Economics University of California, Santa Cruz 1156 High Street, Santa Cruz, CA 95064, USA
Pseudocode	Yes	1. For r = 1...R replications, draw I diﬀerent subsamples of sizes bi for i = 1...I where bi and bi/n 0 and compute bi,r ˆLCV, b,i (M2) ˆLCV, b,i (M1) ˆLCV n (M2) ˆLCV n (M1) . 2. For each i = 1...I and j = 1...J, deﬁne cτij and cρij as the τjth and ρjth percentiles of bi,r for i = 1...I, where τj and ρj are the jth elements of τ = n 70, 70 + 20 J 1, 70 + 40 J 1, ..., 90 o and ρ = n 10, 10 + 20 J 1, 10 + 40 J 1, ..., 30 o . 3. Deﬁne yij = log cτij cρij , yi = 1 J PJ j=1 yij, y = 1 I PI i=1 yi, log (b) = 1 I PI i=1 log (bi). 4. The estimated rate of convergence is nˆδ for PI i=1 ( yi y) log (bi) log (b) PI i=1 log (bi) log (b) 2
Open Source Code	No	The paper mentions using 'grf R package' and 'deepnet R package' for some computations, but does not provide any statement or link for the source code of the methodology described in this paper by the authors themselves.
Open Datasets	No	The data come from Go Daddy, a domain name registrar responsible for managing sales of internet domain names. Each observation is a particular domain name listed on a Go Daddy expiry auction between May 12th, 2017 and July 11th, 2017. ... The paper describes the source of the data but does not provide any concrete access information (link, DOI, repository, or citation) for this dataset.
Dataset Splits	Yes	For each Monte Carlo simulation r = 1...R, we generate the test data ( xri, zri, yri)n T est i=1 and the training data (xri, zri, yri)n i=1 independently of each other... The 5-fold cross validation error using squared error loss is... We perform three pairwise nominal (5/3)%-level 5-fold cross validation with squared error loss estimator selection tests...
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the Monte Carlo simulations or the empirical application.
Software Dependencies	No	The second estimator is an honest Random Forest estimator using all independent variables and computed using the grf R package s regression forest command with the default options. The third estimator is a single-hidden layer Neural Network with a sigmoidal activation function and 5 hidden units using all independent variables and computed using the nn.train command in the deepnet R package. The paper mentions specific R packages but does not provide their version numbers, nor the version of R itself.
Experiment Setup	Yes	We consider three diﬀerent values for γ0 while keeping β0 at 0.5. ...using a Gaussian kernel Khn (x) = K (x/hn), K (x) = (2π) 1/2 e x2/2, and bandwidth hn = (4/3)1/5n 1/5. ...a single-hidden layer Neural Network with a sigmoidal activation function and 5 hidden units... We examine the empirical frequencies of failing to reject the null... under six diﬀerent choices of β0 n 0, 1 n, 1 n, n 1/4, n 1/6, 1 o . ...using n = 5000 observations, R = 5000 Monte Carlo simulations, and B = 5000 subsampling replications with a subsample size of b = n. ...I = 10 diﬀerent values of the subsample size n(0.5:0.05:0.95). ...The results are the same across a range of diﬀerent values of the subsample size b = nκ , where κ {0.4, 0.5, 0.6, 0.7, 0.8}.