reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Minimizing the Training Set Fill Distance in Machine Learning Regression

Authors: Paolo Climaco, Jochen Garcke

DMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin.
Researcher Affiliation	Academia	Paolo Climaco EMAIL Jochen Garcke , EMAIL Institut f ur Numerische Simulation, Universit at Bonn, Germany Fraunhofer SCAI, Sankt Augustin, Germany
Pseudocode	Yes	Algorithm 1 Farthest Point Sampling (FPS)
Open Source Code	Yes	Note that our Git Hub repository2 contains all the code necessary to reproduce the results we present. The repository includes code for downloading, reading, and preprocessing the datasets, our implementation of the FPS, regression models, and evaluation procedures. Furthermore, we have included a Jupyter notebook that reproduces the experiments on QM7, with a runtime of only a few minutes. 2. at https://github.com/Fraunhofer-SCAI/Fill_Distance_Regression
Open Datasets	Yes	QM7 (Blum and Reymond, 2009; Rupp et al., 2012) is a benchmark dataset in quantum chemistry, consisting of 7165 small organic molecules... QM8 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2015) is a curated collection of 21,786 organic molecules... QM9 (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014) is a publicly available quantum chemistry dataset... The revised MD17 (Christensen and von Lilienfeld, 2020b) (r MD17) is an updated version of the molecular dynamics dataset (MD17) (Unke et al., 2021)...
Dataset Splits	Yes	For each sampling strategy, we construct multiple training sets consisting of different amounts of samples. For each sampling strategy and training set size, the training set selection process is independently run five times. In the case of RDM, points are independently and uniformly selected at each run, while for the other sampling techniques, the initial point to initialize is randomly selected at each run. Therefore, for each selection strategy and training set size, each analyzed model is independently trained and tested five times. The reported test results are the average of the five runs. We also plot error bands, which, unless otherwise specified, represent the standard deviation of the results. ...We analyze a different range for the size of the training sets, from 0.1% to 1% of the available points, instead of the range 1%10%.
Hardware Specification	Yes	1. We used a 48-cores CPU with 384 GB RAM.
Software Dependencies	No	The paper mentions software like 'Mordred' and 'RDKit package' but does not specify their version numbers. It also refers to 'KRR' and 'FNNs' as model types, not specific software with versions.
Experiment Setup	Yes	The hyperparameters γ and λ are optimized through the following process: first we perform a cross-validation grid search to find the best hyperparameter for each training set size using subsets obtained by random sampling. Next, the average of the best parameter pair for each training set size is used to build the final model. The KRR hyperparameters are varied on a grid of 12 points between 10^-14 and 10^-2. ...The values of λ we employed are 1.9 * 10^-4, 2.2 * 10^-3 and 1.5 * 10^-11 for QM7, QM8 and QM9, respectively. ...Following along (Pinheiro et al., 2020), we set l = 3, consider only Re Lu activation functions...