Model Selection via the VC Dimension
Authors: Merlin Mpoudeu, Bertrand Clarke
JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our estimator is consistent. Then, we verify it performs well compared to seven other model selection techniques. We do this for a variety of types of data sets. Keywords: Vapnik-Chervonenkis dimension, model selection, Bayesian information criterion, sparsity methods, empirical risk minimization, multi-type data. [...] In Sec. 4 we present our studies using simulated, benchmark, and real data. We compare our method for model selection to AIC, BIC, CV, \ PERM 1, and \ PERM 2. |
| Researcher Affiliation | Collaboration | Merlin Mpoudeu EMAIL Bank of America Atlanta, GA, USA Bertrand Clarke EMAIL Department of Statistics University of Nebraska-Lincoln Lincoln, NE 68503, USA |
| Pseudocode | Yes | Our algorithm is as follows. Algorithm #1 A collection of regression models G = {gβ}, A data set, Two integers b1 and b2 for the number of bootstrap samples, An integer m for the number of subintervals to discretize the losses, A set of design points NL = {n1, n2, . . . , n L}. |
| Open Source Code | Yes | Data, code, and results for the full versions of the analyses presented here can be found at https://github.com/ poudas1981/Wheat_data_set. |
| Open Datasets | Yes | We re-analyze the Wheat data set presented and studied in Campbell et al. (2003), Dilbirligi et al. (2006), and Dhungana et al. (2007) from a non-complexity based standpoint. The Wheat data set has 2912 observations. More information concerning the data set and the design structure can be found in Campbell et al. (2003). |
| Dataset Splits | Yes | 2. Randomly subdivide the bootstrap data into two groups G1 and G2 of size nl each; |
| Hardware Specification | No | The paper acknowledges "invaluable computational support from the Holland Computing Center" but does not specify any particular hardware models (e.g., CPU, GPU, memory). |
| Software Dependencies | No | The paper does not provide specific version numbers for any software libraries, programming languages, or tools used in the experiments. |
| Experiment Setup | Yes | We arbitrarily set m = 10 and W = 50. Our models were nested, including models that were too small and some that were too large, so that the estimate of d V C would uniquely specify a model. As a typical example, Fig. 1 shows the results for n = 700 and p = 70 with L = 7. [...] The design points for ˆd V and ˆd V C are 20, 30, 40, 50, 60, 70, 80, 90, and 100 and W = 50. |