reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

Authors: Anastasis Kratsios, Haitz Sáez de Ocáriz Borde, Takashi Furuya, Marc T. Law

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach in two standard machine learning tasks: regression and classification. We experimentally show that Mo MLPs, which distribute predictions over multiple neural networks, are competitive with a single large neural network containing as many model parameters as all the Mo MLPs combined. This is desirable in cases where the large neural network does not fit into the memory of a single machine. On the contrary, the Mo MLP model can be trained by distributing each Mo MLP on separate machines (or equivalently serially on a single machine). Inference can then be performed by loading only a single Mo MLP at a time into the GPU.
Researcher Affiliation	Collaboration	Anastasis Kratsios EMAIL Vector Institute and Mc Master University, Canada Haitz Sáez de Ocáriz Borde EMAIL Oxford University, United Kingdom Takashi Furuya EMAIL Shimane University, Japan Marc T. Law EMAIL NVIDIA, Canada
Pseudocode	Yes	Algorithm 1: Routing Tree from Example 1. Algorithm 2: Mo MLPs Training.
Open Source Code	Yes	We include here experimental details, we refer to the source code in the supplementary material for more details.
Open Datasets	Yes	Datasets. We evaluate classification on standard image datasets such as CIFAR-10 (Krizhevsky & Hinton, 2010), CIFAR-100, and Food-101 (Bossard et al., 2014), which consist of 10, 100, and 101 different classes, respectively.
Dataset Splits	Yes	Our training and test samples are the sn vertices of the regular grid defined on [a, b]n. At each run, 80% of the samples are randomly selected for training and validation, and the remaining 20% for testing.
Hardware Specification	No	The paper discusses concepts like 'GPU VRAM' and 'loading onto the GPU' in relation to the model's operation and memory requirements, but it does not specify any particular GPU models, CPU models, or other hardware components used to conduct the experiments.
Software Dependencies	No	The paper mentions 'Pytorch' and optimizers 'Adam' and 'Adam W', but it does not provide specific version numbers for any of these software components, which is required for reproducibility.
Experiment Setup	Yes	We set the width of our Mo MLPs to w = 1000. In other words, each hidden layer of our Mo MLPs contains a linear matrix of size w w. In the regression task, our Mo MLPs contain 3 hidden layers and we use a learnable PRe LU as activation function. For training, we use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10 4 and the default Pytorch hyperparameters. In the classification task, we follow the setup of Oquab et al. (2023) and use Adam W (Loshchilov & Hutter, 2019) as the optimizer with a learning rate of 10 3 and the default parameters from Py Torch. Our Mo MLPs consist of four hidden layers for the classification task, and we apply Batch Norm1d before the PRe LU activation function.