Scaling Laws of Distributed Random Forests
Authors: Katharina Flügel, Charlotte Debus, Markus Götz, Achim Streit, Marie Weiel
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this gap, we present a comprehensive analysis of the scaling capabilities of distributed random forests on up to 64 compute nodes. Using a tree-parallel approach, we demonstrate a strong scaling speedup of up to 31.98 and a weak scaling efficiency of over 0.96 without affecting predictive performance of the global model. |
| Researcher Affiliation | Academia | Katharina Flügel EMAIL Karlsruhe Institute of Technology (KIT), Scientific Computing Center (SCC) Helmholtz AI |
| Pseudocode | Yes | A.1 Distributed Random Forests Pseudocode Algorithm 1 summarizes the tree-parallel training of distributed random forests. Algorithms 2 and 3 describe the two variants for inference, either aggregating a global model or using global voting for distributed inference. |
| Open Source Code | Yes | Our code is open-source and publicly available at github.com/Helmholtz-AIEnergy/special-couscous. |
| Open Datasets | Yes | We use synthetic data generated with scikit-learn s make_classification as this allows us to scale both the number of samples n and features m freely and adjust the class balance. ... Additionally, we extend the strong and weak scaling experiments to the HIGGS dataset (Baldi et al., 2014; Whiteson, 2014), a binary classification task to distinguish between signal and background events in particle collision data. |
| Dataset Splits | Yes | For all datasets, 75 % of the samples n = 0.75 ntotal are used as training set, while the remaining 25 % are used as test set. During bootstrapping, each tree draws a random set of n samples with replacement. |
| Hardware Specification | Yes | All experiments were conducted on up to 64 compute nodes, each of which has two Intel Xeon Platinum 8368 processors for a total of 76 cores, 64 k B L1 and 1 MB L2 cache per core, and 57 MB L3 cache per processor. Most experiments used standard compute nodes with 256 GB main memory. The exception is the serial baseline for the strong scaling experiments, which used high-memory nodes with 512 GB main memory but otherwise identical hardware to fit the model and data. All nodes are connected with Infini Band 4X HDR 200 Gbit/s interconnect. |
| Software Dependencies | Yes | All experiments used Open MPI v4.1.6, Python v3.11.2, mpi4py v4.0.1, numpy v2.2.2, scikit-learn v1.6.1, and scipy v1.15.1. |
| Experiment Setup | Yes | We run two series of experiments: training t = 1600 trees on the 1M dataset and t = 448 trees on the 10M dataset on p {1, 2, 4, 8, 16, 32, 64} compute nodes. Each node trains a local forest of t/p trees. The number of trees was chosen as the maximum multiple of 64 we could train within 100 min on a single node. |