reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Random Feature Representation Boosting

Authors: Nikita Zozoulenko, Thomas Cass, Lukas Gonon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive numerical experiments on tabular datasets for both regression and classification, we show that RFRBoost significantly outperforms RFNNs and end-to-end trained MLP Res Nets in the smallto medium-scale regime where RFNNs are typically applied.
Researcher Affiliation	Academia	1Department of Mathematics, Imperial College London, UK 2School of Computer Science, University of St. Gallen, Switzerland. Correspondence to: Nikita Zozoulenko <EMAIL>.
Pseudocode	Yes	The complete procedure is detailed in Algorithm 1. Algorithm 1 Greedy RFRBoost MSE Loss Input: Data (xi, yi)n i=1, T layers, learning rate η, ℓ2 regularization λ, initial representation Φ0. W0 argmin W 1 n Pn i=1 yi W Φ0(xi) 2 ... The full gradientgreedy procedure is detailed in Algorithm 2.
Open Source Code	Yes	All our code is publicly available at https://github.com/nikitazozoulenko/random-feature-representation-boosting.
Open Datasets	Yes	We experiment on all datasets of the curated Open ML tabular regression (Fischer et al., 2023) and classification (Bischl et al., 2021) benchmark suites with 200 or fewer features. To complement our experiments on the Open ML benchmark suite, we conducted additional full-scale evaluations on four larger datasets... (one classification: Cover Type (Blackard, 1998); one regression: Year Prediction MSD (YPMSD) (Bertin-Mahieux, 2011)).
Dataset Splits	Yes	Evaluation Procedure: We use a nested 5-fold crossvalidation (CV) procedure to tune and evaluate all models, run independently for each dataset. For these larger-scale evaluations, we employed a hold-out test set strategy. Hyperparameters for each model were selected via a grid search performed on a 20% validation split of the training data. For YPMSD (regression), we adhered to the designated split: the first 463,715 examples for training and the subsequent 51,630 examples for testing. For Cover Type (classification), we used the standard train-test split provided by the popular Python machine learning library Scikit-learn (Pedregosa et al., 2011), yielding 464,809 training and 116,203 test instances.
Hardware Specification	Yes	All experiments are run on a single CPU core on an institutional HPC cluster, mostly comprised of AMD EPYC 7742 nodes. All experiments were carried out on a single NVIDIA RTX 6000 (Turing architecture) GPU.
Software Dependencies	No	The paper mentions software like Optuna (Akiba et al., 2019), Py Torch (Paszke et al., 2019), and Scikit-learn (Pedregosa et al., 2011). However, the years provided in the citations refer to the publication year of the papers introducing these software packages, not the specific version numbers used for the experiments. Specific version numbers for the software dependencies are not provided.
Experiment Setup	Yes	Hyperparameters: All hyperparameters are tuned with Optuna in the innermost fold, using 100 trials per outer fold, model, and dataset. For ridge and logistic regression, we tune the ℓ2 regularization. For the neural network-based models, we fix the feature dimension of each residual block to 512 and use 1 to 10 layers. For E2E networks, we tune the hidden size, learning rate, learning rate decay, number of epochs, batch size, and weight decay. For RFRBoost, we tune the ℓ2 regularization of the linear predictor and functional gradient mapping, the boosting learning rate, and the variance of the random features. For RFNNs, we tune the random feature dimension, random feature variance, and ℓ2 regularization. For XGBoost, we tune the ℓ1 and ℓ2 regularization, tree depth, boosting learning rate, and the number of weak learners. For a detailed list of hyperparameter ranges, along with an ablation study comparing SWIM random features to i.i.d. Gaussian random features, we refer the reader to Appendix E.