On Minimizing the Training Set Fill Distance in Machine Learning Regression
Authors: Paolo Climaco, Jochen Garcke
DMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. |
| Researcher Affiliation | Academia | Paolo Climaco EMAIL Jochen Garcke , EMAIL Institut f ur Numerische Simulation, Universit at Bonn, Germany Fraunhofer SCAI, Sankt Augustin, Germany |
| Pseudocode | Yes | Algorithm 1 Farthest Point Sampling (FPS) |
| Open Source Code | Yes | Note that our Git Hub repository2 contains all the code necessary to reproduce the results we present. The repository includes code for downloading, reading, and preprocessing the datasets, our implementation of the FPS, regression models, and evaluation procedures. Furthermore, we have included a Jupyter notebook that reproduces the experiments on QM7, with a runtime of only a few minutes. 2. at https://github.com/Fraunhofer-SCAI/Fill_Distance_Regression |
| Open Datasets | Yes | QM7 (Blum and Reymond, 2009; Rupp et al., 2012) is a benchmark dataset in quantum chemistry, consisting of 7165 small organic molecules... QM8 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2015) is a curated collection of 21,786 organic molecules... QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014) is a publicly available quantum chemistry dataset... The revised MD17 (Christensen and von Lilienfeld, 2020b) (r MD17) is an updated version of the molecular dynamics dataset (MD17) (Unke et al., 2021)... |
| Dataset Splits | Yes | For each sampling strategy, we construct multiple training sets consisting of different amounts of samples. For each sampling strategy and training set size, the training set selection process is independently run five times. In the case of RDM, points are independently and uniformly selected at each run, while for the other sampling techniques, the initial point to initialize is randomly selected at each run. Therefore, for each selection strategy and training set size, each analyzed model is independently trained and tested five times. The reported test results are the average of the five runs. We also plot error bands, which, unless otherwise specified, represent the standard deviation of the results. ...We analyze a different range for the size of the training sets, from 0.1% to 1% of the available points, instead of the range 1%10%. |
| Hardware Specification | Yes | 1. We used a 48-cores CPU with 384 GB RAM. |
| Software Dependencies | No | The paper mentions software like 'Mordred' and 'RDKit package' but does not specify their version numbers. It also refers to 'KRR' and 'FNNs' as model types, not specific software with versions. |
| Experiment Setup | Yes | The hyperparameters γ and λ are optimized through the following process: first we perform a cross-validation grid search to find the best hyperparameter for each training set size using subsets obtained by random sampling. Next, the average of the best parameter pair for each training set size is used to build the final model. The KRR hyperparameters are varied on a grid of 12 points between 10^-14 and 10^-2. ...The values of λ we employed are 1.9 * 10^-4, 2.2 * 10^-3 and 1.5 * 10^-11 for QM7, QM8 and QM9, respectively. ...Following along (Pinheiro et al., 2020), we set l = 3, consider only Re Lu activation functions... |