reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-validation for Geospatial Data: Estimating Generalization Performance in Geostatistical Problems

Authors: Jing Wang, Laurel Hopkins, Tyler Hallman, W. Douglas Robinson, Rebecca Hutchinson

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users. We provide simulated data experiments measuring the bias of several CV estimators, SDM examples demonstrating four geospatial scenarios, and further empirical analyses of our proposed algorithm.
Researcher Affiliation	Academia	Jing Wang EMAIL School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331-5501, USA Laurel M. Hopkins EMAIL School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331-5501, USA Tyler A. Hallman EMAIL School of Natural Sciences Bangor University Bangor LL57 2DG, UK W. Douglas Robinson EMAIL Department of Fisheries, Wildlife, and Conservation Sciences Oregon State University Corvallis, OR 97331-5501, USA Rebecca A. Hutchinson EMAIL School of Electrical Engineering and Computer Science Department of Fisheries, Wildlife, and Conservation Sciences Oregon State University Corvallis, OR 97331-5501, USA
Pseudocode	Yes	Algorithm 1 LOOIBCV Input: training set {Ti}n i=1 = {xi, yi}n i=1 and density ratios {wi}n i=1 Parameters: buffer size r Output: estimated error Err 1: for i = 1 to n do 2: Remove data points within the buffer area based on the longitude (long) and the latitude (lat) of Ti: [long r, long + r, lat r, lat + r]. 3: Fit a model ˆf on the remaining data T i r. 4: Calculate density ratio weighted loss on the validation fold Ti: Erri = wi L(yi, ˆyi(xi; T i r)). 5: end for 6: Return the estimated error: Err = 1 n Pn i=1 Erri.
Open Source Code	Yes	Code and data are available at https://github.com/Hutchinson-Lab/Cross-validation-for-Geospatial-Data.
Open Datasets	Yes	Project website: https://oregon2020.com/ Link to the dataset: https://alaska.usgs.gov/products/data.php?dataid=197 Link to the dataset: https://www.kaggle.com/datasets/camnugent/california-housing-prices
Dataset Splits	Yes	Every dataset contained 1800 training and 500 testing samples, but the scenarios varied in the geographic sampling of these grid locations. We created training and testing datasets for three combinations of species and dataset size, each with four geographic layouts to set up the different scenarios described above. We assembled datasets with either 1000 or 1800 training observations and in each case tested on 500 held-out observations; datasets are named by their species abbreviation and training sample size (i.e., HEWA1000, HEWA1800 and WETA1800). The training set consisted of 710 data points (355 non-detections and 355 detections) collected from KATM and LACL in 2004-2006. The test set contained 134 data points (67 nondetections and 67 detections) collected from ANIA in 2008. We split the data into training and test sets in two ways: (a) by Latitude, where the test set includes 551 points above 39.5 N and the training set includes 10968 points below 39 N; and (b) by Bay, where the test set consists of 981 points within San Francisco Bay Area and the training set consists of 9775 points whose distance is at least 0.3 degree from any test sample (Fig. 8).
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions running 'machine learning models' but provides no specific details on CPU, GPU, or other computing resources.
Software Dependencies	No	We used the default hyperparameters from the scikit-learn Python package (Pedregosa et al., 2011) for all models. We applied the Relative unconstrained Least-Squares Importance Fitting (Ru LSIF) method is applied to estimate density ratios (Yamada et al., 2011), with α = 0 and 50 kernels. We fitted Matérn variogram functions with the lag class estimated by Scott s rule (Mälicke, 2021) to calculate ranges of the features of training sets. The paper mentions software such as 'scikit-learn', 'Ru LSIF', and 'Matérn variogram functions' but does not provide specific version numbers for these, which are required for reproducibility.
Experiment Setup	Yes	We set k = 9 folds for all cross-validation methods. We set the block size of BLCV = 2, 4, 8, 12 grids respectively when r = 1, 4, 8, 12, such that the block size would mimic the spatial autocorrelation range. Buffered BFCV and importance weighted buffered IBCV use a similar grid to BLCV, but with block size of 20, which produces 9 blocks on the 60 60 landscape, each of which becomes its own fold. We set buffer size equal to the spatial autocorrelation range of the simulation so that the minimum distance between the training and validation folds was equal or greater than the range. Finally, IWCV and IBCV required density ratio estimates. We applied the Relative unconstrained Least-Squares Importance Fitting (Ru LSIF) method to estimate density ratios (Yamada et al., 2011), with α = 0 and 50 kernels. We explored five classification models for each SDM: Ridge classifier (Ridge), Linear SVM (LSVM), K-Nearest Neighbors (KNN), Random Forest (RF), and Naive Bayes (NB), and we compared their test errors with the CV error estimates. We used the default hyperparameters from the scikit-learn Python package (Pedregosa et al., 2011) for all models.