Cross-validation for Geospatial Data: Estimating Generalization Performance in Geostatistical Problems
Authors: Jing Wang, Laurel Hopkins, Tyler Hallman, W. Douglas Robinson, Rebecca Hutchinson
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users. We provide simulated data experiments measuring the bias of several CV estimators, SDM examples demonstrating four geospatial scenarios, and further empirical analyses of our proposed algorithm. |
| Researcher Affiliation | Academia | Jing Wang EMAIL School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331-5501, USA Laurel M. Hopkins EMAIL School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331-5501, USA Tyler A. Hallman EMAIL School of Natural Sciences Bangor University Bangor LL57 2DG, UK W. Douglas Robinson EMAIL Department of Fisheries, Wildlife, and Conservation Sciences Oregon State University Corvallis, OR 97331-5501, USA Rebecca A. Hutchinson EMAIL School of Electrical Engineering and Computer Science Department of Fisheries, Wildlife, and Conservation Sciences Oregon State University Corvallis, OR 97331-5501, USA |
| Pseudocode | Yes | Algorithm 1 LOOIBCV Input: training set {Ti}n i=1 = {xi, yi}n i=1 and density ratios {wi}n i=1 Parameters: buffer size r Output: estimated error Err 1: for i = 1 to n do 2: Remove data points within the buffer area based on the longitude (long) and the latitude (lat) of Ti: [long r, long + r, lat r, lat + r]. 3: Fit a model ˆf on the remaining data T i r. 4: Calculate density ratio weighted loss on the validation fold Ti: Erri = wi L(yi, ˆyi(xi; T i r)). 5: end for 6: Return the estimated error: Err = 1 n Pn i=1 Erri. |
| Open Source Code | Yes | Code and data are available at https://github.com/Hutchinson-Lab/Cross-validation-for-Geospatial-Data. |
| Open Datasets | Yes | Project website: https://oregon2020.com/ Link to the dataset: https://alaska.usgs.gov/products/data.php?dataid=197 Link to the dataset: https://www.kaggle.com/datasets/camnugent/california-housing-prices |
| Dataset Splits | Yes | Every dataset contained 1800 training and 500 testing samples, but the scenarios varied in the geographic sampling of these grid locations. We created training and testing datasets for three combinations of species and dataset size, each with four geographic layouts to set up the different scenarios described above. We assembled datasets with either 1000 or 1800 training observations and in each case tested on 500 held-out observations; datasets are named by their species abbreviation and training sample size (i.e., HEWA1000, HEWA1800 and WETA1800). The training set consisted of 710 data points (355 non-detections and 355 detections) collected from KATM and LACL in 2004-2006. The test set contained 134 data points (67 nondetections and 67 detections) collected from ANIA in 2008. We split the data into training and test sets in two ways: (a) by Latitude, where the test set includes 551 points above 39.5 N and the training set includes 10968 points below 39 N; and (b) by Bay, where the test set consists of 981 points within San Francisco Bay Area and the training set consists of 9775 points whose distance is at least 0.3 degree from any test sample (Fig. 8). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. It mentions running 'machine learning models' but provides no specific details on CPU, GPU, or other computing resources. |
| Software Dependencies | No | We used the default hyperparameters from the scikit-learn Python package (Pedregosa et al., 2011) for all models. We applied the Relative unconstrained Least-Squares Importance Fitting (Ru LSIF) method is applied to estimate density ratios (Yamada et al., 2011), with α = 0 and 50 kernels. We fitted Matérn variogram functions with the lag class estimated by Scott s rule (Mälicke, 2021) to calculate ranges of the features of training sets. The paper mentions software such as 'scikit-learn', 'Ru LSIF', and 'Matérn variogram functions' but does not provide specific version numbers for these, which are required for reproducibility. |
| Experiment Setup | Yes | We set k = 9 folds for all cross-validation methods. We set the block size of BLCV = 2, 4, 8, 12 grids respectively when r = 1, 4, 8, 12, such that the block size would mimic the spatial autocorrelation range. Buffered BFCV and importance weighted buffered IBCV use a similar grid to BLCV, but with block size of 20, which produces 9 blocks on the 60 60 landscape, each of which becomes its own fold. We set buffer size equal to the spatial autocorrelation range of the simulation so that the minimum distance between the training and validation folds was equal or greater than the range. Finally, IWCV and IBCV required density ratio estimates. We applied the Relative unconstrained Least-Squares Importance Fitting (Ru LSIF) method to estimate density ratios (Yamada et al., 2011), with α = 0 and 50 kernels. We explored five classification models for each SDM: Ridge classifier (Ridge), Linear SVM (LSVM), K-Nearest Neighbors (KNN), Random Forest (RF), and Naive Bayes (NB), and we compared their test errors with the CV error estimates. We used the default hyperparameters from the scikit-learn Python package (Pedregosa et al., 2011) for all models. |