Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts
Authors: Anastasis Kratsios, Haitz Sáez de Ocáriz Borde, Takashi Furuya, Marc T. Law
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach in two standard machine learning tasks: regression and classification. We experimentally show that Mo MLPs, which distribute predictions over multiple neural networks, are competitive with a single large neural network containing as many model parameters as all the Mo MLPs combined. This is desirable in cases where the large neural network does not fit into the memory of a single machine. On the contrary, the Mo MLP model can be trained by distributing each Mo MLP on separate machines (or equivalently serially on a single machine). Inference can then be performed by loading only a single Mo MLP at a time into the GPU. |
| Researcher Affiliation | Collaboration | Anastasis Kratsios EMAIL Vector Institute and Mc Master University, Canada Haitz Sáez de Ocáriz Borde EMAIL Oxford University, United Kingdom Takashi Furuya EMAIL Shimane University, Japan Marc T. Law EMAIL NVIDIA, Canada |
| Pseudocode | Yes | Algorithm 1: Routing Tree from Example 1. Algorithm 2: Mo MLPs Training. |
| Open Source Code | Yes | We include here experimental details, we refer to the source code in the supplementary material for more details. |
| Open Datasets | Yes | Datasets. We evaluate classification on standard image datasets such as CIFAR-10 (Krizhevsky & Hinton, 2010), CIFAR-100, and Food-101 (Bossard et al., 2014), which consist of 10, 100, and 101 different classes, respectively. |
| Dataset Splits | Yes | Our training and test samples are the sn vertices of the regular grid defined on [a, b]n. At each run, 80% of the samples are randomly selected for training and validation, and the remaining 20% for testing. |
| Hardware Specification | No | The paper discusses concepts like 'GPU VRAM' and 'loading onto the GPU' in relation to the model's operation and memory requirements, but it does not specify any particular GPU models, CPU models, or other hardware components used to conduct the experiments. |
| Software Dependencies | No | The paper mentions 'Pytorch' and optimizers 'Adam' and 'Adam W', but it does not provide specific version numbers for any of these software components, which is required for reproducibility. |
| Experiment Setup | Yes | We set the width of our Mo MLPs to w = 1000. In other words, each hidden layer of our Mo MLPs contains a linear matrix of size w w. In the regression task, our Mo MLPs contain 3 hidden layers and we use a learnable PRe LU as activation function. For training, we use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 10 4 and the default Pytorch hyperparameters. In the classification task, we follow the setup of Oquab et al. (2023) and use Adam W (Loshchilov & Hutter, 2019) as the optimizer with a learning rate of 10 3 and the default parameters from Py Torch. Our Mo MLPs consist of four hidden layers for the classification task, and we apply Batch Norm1d before the PRe LU activation function. |