Asymptotics of K-Fold Cross Validation

Authors: Jessie Li

JAIR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Monte Carlo simulations demonstrate the asymptotic validity of our confidence intervals for the expected out-of-sample error and investigate the size and power properties of our test. In our empirical application, we use our estimator selection test to compare the out-of-sample predictive performance of OLS, Neural Networks, and Random Forests for predicting the sale price of a domain name in a Go Daddy expiry auction.
Researcher Affiliation Academia Jessie Li EMAIL Department of Economics University of California, Santa Cruz 1156 High Street, Santa Cruz, CA 95064, USA
Pseudocode Yes 1. For r = 1...R replications, draw I different subsamples of sizes bi for i = 1...I where bi and bi/n 0 and compute bi,r ˆLCV, b,i (M2) ˆLCV, b,i (M1) ˆLCV n (M2) ˆLCV n (M1) . 2. For each i = 1...I and j = 1...J, define cτij and cρij as the τjth and ρjth percentiles of bi,r for i = 1...I, where τj and ρj are the jth elements of τ = n 70, 70 + 20 J 1, 70 + 40 J 1, ..., 90 o and ρ = n 10, 10 + 20 J 1, 10 + 40 J 1, ..., 30 o . 3. Define yij = log cτij cρij , yi = 1 J PJ j=1 yij, y = 1 I PI i=1 yi, log (b) = 1 I PI i=1 log (bi). 4. The estimated rate of convergence is nˆδ for PI i=1 ( yi y) log (bi) log (b) PI i=1 log (bi) log (b) 2
Open Source Code No The paper mentions using 'grf R package' and 'deepnet R package' for some computations, but does not provide any statement or link for the source code of the methodology described in this paper by the authors themselves.
Open Datasets No The data come from Go Daddy, a domain name registrar responsible for managing sales of internet domain names. Each observation is a particular domain name listed on a Go Daddy expiry auction between May 12th, 2017 and July 11th, 2017. ... The paper describes the source of the data but does not provide any concrete access information (link, DOI, repository, or citation) for this dataset.
Dataset Splits Yes For each Monte Carlo simulation r = 1...R, we generate the test data ( xri, zri, yri)n T est i=1 and the training data (xri, zri, yri)n i=1 independently of each other... The 5-fold cross validation error using squared error loss is... We perform three pairwise nominal (5/3)%-level 5-fold cross validation with squared error loss estimator selection tests...
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the Monte Carlo simulations or the empirical application.
Software Dependencies No The second estimator is an honest Random Forest estimator using all independent variables and computed using the grf R package s regression forest command with the default options. The third estimator is a single-hidden layer Neural Network with a sigmoidal activation function and 5 hidden units using all independent variables and computed using the nn.train command in the deepnet R package. The paper mentions specific R packages but does not provide their version numbers, nor the version of R itself.
Experiment Setup Yes We consider three different values for γ0 while keeping β0 at 0.5. ...using a Gaussian kernel Khn (x) = K (x/hn), K (x) = (2π) 1/2 e x2/2, and bandwidth hn = (4/3)1/5n 1/5. ...a single-hidden layer Neural Network with a sigmoidal activation function and 5 hidden units... We examine the empirical frequencies of failing to reject the null... under six different choices of β0 n 0, 1 n, 1 n, n 1/4, n 1/6, 1 o . ...using n = 5000 observations, R = 5000 Monte Carlo simulations, and B = 5000 subsampling replications with a subsample size of b = n. ...I = 10 different values of the subsample size n(0.5:0.05:0.95). ...The results are the same across a range of different values of the subsample size b = nκ , where κ {0.4, 0.5, 0.6, 0.7, 0.8}.