reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws of Distributed Random Forests

Authors: Katharina Flügel, Charlotte Debus, Markus Götz, Achim Streit, Marie Weiel

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this gap, we present a comprehensive analysis of the scaling capabilities of distributed random forests on up to 64 compute nodes. Using a tree-parallel approach, we demonstrate a strong scaling speedup of up to 31.98 and a weak scaling efficiency of over 0.96 without affecting predictive performance of the global model.
Researcher Affiliation	Academia	Katharina Flügel EMAIL Karlsruhe Institute of Technology (KIT), Scientific Computing Center (SCC) Helmholtz AI
Pseudocode	Yes	A.1 Distributed Random Forests Pseudocode Algorithm 1 summarizes the tree-parallel training of distributed random forests. Algorithms 2 and 3 describe the two variants for inference, either aggregating a global model or using global voting for distributed inference.
Open Source Code	Yes	Our code is open-source and publicly available at github.com/Helmholtz-AIEnergy/special-couscous.
Open Datasets	Yes	We use synthetic data generated with scikit-learn s make_classification as this allows us to scale both the number of samples n and features m freely and adjust the class balance. ... Additionally, we extend the strong and weak scaling experiments to the HIGGS dataset (Baldi et al., 2014; Whiteson, 2014), a binary classification task to distinguish between signal and background events in particle collision data.
Dataset Splits	Yes	For all datasets, 75 % of the samples n = 0.75 ntotal are used as training set, while the remaining 25 % are used as test set. During bootstrapping, each tree draws a random set of n samples with replacement.
Hardware Specification	Yes	All experiments were conducted on up to 64 compute nodes, each of which has two Intel Xeon Platinum 8368 processors for a total of 76 cores, 64 k B L1 and 1 MB L2 cache per core, and 57 MB L3 cache per processor. Most experiments used standard compute nodes with 256 GB main memory. The exception is the serial baseline for the strong scaling experiments, which used high-memory nodes with 512 GB main memory but otherwise identical hardware to fit the model and data. All nodes are connected with Infini Band 4X HDR 200 Gbit/s interconnect.
Software Dependencies	Yes	All experiments used Open MPI v4.1.6, Python v3.11.2, mpi4py v4.0.1, numpy v2.2.2, scikit-learn v1.6.1, and scipy v1.15.1.
Experiment Setup	Yes	We run two series of experiments: training t = 1600 trees on the 1M dataset and t = 448 trees on the 10M dataset on p {1, 2, 4, 8, 16, 32, 64} compute nodes. Each node trains a local forest of t/p trees. The number of trees was chosen as the maximum multiple of 64 we could train within 100 min on a single node.