Breaking Neural Network Scaling Laws with Modularity
Authors: Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our theoretical model on a novel parametrically controllable sine wave regression task and show that sample complexity varies exponentially with task dimension. We empirically validate the improved generalizability (both inand out-of-distribution) of our modular learning approach on parametrically controllable, high-dimensional tasks: sine-wave regression and Compositional CIFAR-10. |
| Researcher Affiliation | Academia | Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete Massachusetts Institute of Technology EMAIL |
| Pseudocode | Yes | Alg 1 shows the full procedure to find a single module projection ˆUi; each step of the algorithm simply applies gradient on Eqn 17 with respect to ˆUi. We repeat this procedure K times with different random initializations to find the initial values of all K module projections in our architecture. Then, we train all module parameters (including the ˆUi) via gradient descent on the task loss. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It does not contain a repository link, an explicit code release statement, or code in supplementary materials. |
| Open Datasets | Yes | We conduct experiments on a Compositional CIFAR-10 dataset inspired by the Compositional MNIST dataset of (Jarvis et al., 2023). In the task, combinations of k CIFAR-10 images are concatenated together and the model is asked to predict the class of all component images simultaneously. |
| Dataset Splits | Yes | A training dataset is generated by first drawing n training samples x from a mean-zero Gaussian: x N(0, I). Then, for each x, the regression target y(x) is computed. The test dataset is constructed analogously. ... We use a fixed training set size of 10^6; thus, the probability of a test set point having the same class permutation as a training set point is at most 10^6 / 10^k. For large k, we expect each test set point to test a class permutation unobserved in the training set. We use a test set of size 10000. |
| Hardware Specification | No | Experiments are run on a computing cluster with GPUs ranging in memory size from 11 GB to 80 GB. |
| Software Dependencies | No | Networks are trained using Adam (Kingma & Ba, 2015) to minimize a mean squared error loss. ... In our experiments, all neural networks are fully connected and use Re LU activations except at the final layer. |
| Experiment Setup | Yes | Architecture and hyperparameter settings: sine wave regression In our experiments, all neural networks are fully connected and use Re LU activations except at the final layer. We do not use additional operations in the network such as batch normalization. Networks are trained using Adam (Kingma & Ba, 2015) to minimize a mean squared error loss. We perform a sweep over learning rates in {0.001, 0.01, 0.1} and find that the learning rate of 0.01 performs best in general over all experiments... All networks are trained for 10000 iterations... The network architectures are varied as follows: the width of the hidden layers is selected from {8, 32, 128}, and the number of layers is selected from {3, 5, 7}. |