Breaking Neural Network Scaling Laws with Modularity

Authors: Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our theoretical model on a novel parametrically controllable sine wave regression task and show that sample complexity varies exponentially with task dimension. We empirically validate the improved generalizability (both inand out-of-distribution) of our modular learning approach on parametrically controllable, high-dimensional tasks: sine-wave regression and Compositional CIFAR-10.
Researcher Affiliation Academia Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete Massachusetts Institute of Technology EMAIL
Pseudocode Yes Alg 1 shows the full procedure to find a single module projection ˆUi; each step of the algorithm simply applies gradient on Eqn 17 with respect to ˆUi. We repeat this procedure K times with different random initializations to find the initial values of all K module projections in our architecture. Then, we train all module parameters (including the ˆUi) via gradient descent on the task loss.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It does not contain a repository link, an explicit code release statement, or code in supplementary materials.
Open Datasets Yes We conduct experiments on a Compositional CIFAR-10 dataset inspired by the Compositional MNIST dataset of (Jarvis et al., 2023). In the task, combinations of k CIFAR-10 images are concatenated together and the model is asked to predict the class of all component images simultaneously.
Dataset Splits Yes A training dataset is generated by first drawing n training samples x from a mean-zero Gaussian: x N(0, I). Then, for each x, the regression target y(x) is computed. The test dataset is constructed analogously. ... We use a fixed training set size of 10^6; thus, the probability of a test set point having the same class permutation as a training set point is at most 10^6 / 10^k. For large k, we expect each test set point to test a class permutation unobserved in the training set. We use a test set of size 10000.
Hardware Specification No Experiments are run on a computing cluster with GPUs ranging in memory size from 11 GB to 80 GB.
Software Dependencies No Networks are trained using Adam (Kingma & Ba, 2015) to minimize a mean squared error loss. ... In our experiments, all neural networks are fully connected and use Re LU activations except at the final layer.
Experiment Setup Yes Architecture and hyperparameter settings: sine wave regression In our experiments, all neural networks are fully connected and use Re LU activations except at the final layer. We do not use additional operations in the network such as batch normalization. Networks are trained using Adam (Kingma & Ba, 2015) to minimize a mean squared error loss. We perform a sweep over learning rates in {0.001, 0.01, 0.1} and find that the learning rate of 0.01 performs best in general over all experiments... All networks are trained for 10000 iterations... The network architectures are varied as follows: the width of the hidden layers is selected from {8, 32, 128}, and the number of layers is selected from {3, 5, 7}.