Expressivity of Neural Networks with Random Weights and Learned Biases
Authors: Ezekiel Williams, Alexandre Payeur, Avery Ryoo, Thomas Jiralerspong, Matthew Perich, Luca Mazzucato, Guillaume Lajoie
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can approximate any continuous function on compact sets. We further show an analogous result for the approximation of dynamical systems with recurrent neural networks. Our findings are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on recent fine-tuning methods for large language models, like bias and prefix-based approaches. [...] We further provide empirical support for, and a deeper interrogation of, these results with numerical experiments exploring multi-task learning, motor-control, and dynamical system forecasting. [...] 3 NUMERICAL RESULTS |
| Researcher Affiliation | Academia | Ezekiel Williams Mathematics and Statistics Universit e de Montr eal EMAIL Alexandre Payeur Mathematics and Statistics Universit e de Montr eal EMAIL Avery Hee-Woon Ryoo Computer Science Universit e de Montr eal EMAIL Thomas Jiralerspong Computer Science Universit e de Montr eal EMAIL Matthew G Perich Neuroscience Universit e de Montr eal EMAIL Luca Mazzucato Biology, Physics, and Mathematics University of Oregon EMAIL Guillaume Lajoie Mathematics and Statistics Universit e de Montr eal EMAIL |
| Pseudocode | No | The paper does not contain any sections explicitly labeled as "Pseudocode" or "Algorithm", nor does it present structured steps in a code-like format. |
| Open Source Code | No | The paper does not explicitly state that source code for the methodology described is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We first validated the theory by checking whether a single-hidden-layer bias-learned FNN could perform classification on the Fashion MNIST dataset Deng (2012) increasingly well as its hidden layer was widened. [...] We compared bias learning against a fully-trained neural network with the same size and architecture (Fig. 1B). We found that the bias-only network achieved similar performance to the fully-trained network on most tasks (only significantly worse on KMNIST). An important difference here is that the networks had matched size and architecture, so that the number of trainable parameters in the bias-only network (3.2 104 parameters) was several orders of magnitude smaller than in the fully-trained case (≈ 2.5 107 parameters). Notably, a different set of biases was learned for each task. We conclude that bias-only learning in FNNs could be a viable avenue to perform multi-tasking with randomly initialized and fixed weights, but that it requires a much wider hidden layer than fully trained networks. Lastly, we note that the networks of Fig. 1B were trained with uniformly initialized weights, but that one can achieve similar, or even better performance with different weight initializations (see Fig. E.2C). Next, we investigated the task-specific responses of hidden units by estimating single-unit Task Variances (TV) Yang et al. (2019), defined as the variance of a hidden unit activation across the test set for each task. The TV provides a measure of the extent that a given hidden unit contributes to the given task: a unit with high TV in one task and low TV for all others is seen as selective for its high-TV task. We clustered the hidden-unit TVs using K-means clustering (K chosen by crossvalidation) on the vectors of TVs for each unit and found that distinct clusters of units emerged (Fig. 1C). Some units reflected strong task selectivity (ex: cluster 3 for KMNIST and cluster 10 for Osmanya). Others responded to many, or all, tasks (ex: clusters 1 and 8), although a smaller fraction of clusters exhibited such non-selective activation patterns. Overall, we conclude that multi-task bias learning leads to the emergence of task-specific organization. We note, however, that task selectivity does not necessarily mean task utility: for example, a neuron could have a high variance for a single task but that variance could be picking up on noise and thus not functionally useful. We leave a deeper investigation of the functional significance of task-selectivity to future work. Finally, we explored the relationship between the bias of a hidden unit and its TV. If the neural networks are using biases to shut-off units, analogous to the intuition in our theory (Section 2.1), then the units that do not actively participate in a task should be quiet due to a low bias value learned during training on that particular task. In other words, this intuition would suggest that units should exhibit a correlation between bias and TV, especially in task-specific clusters. In our experiments, all clusters did exhibit the statistical trend of a positive correlation between bias and TV, although to a varying degree across clusters (see numbers at the bottom of Fig. 1C). |
| Dataset Splits | Yes | For figure 1A, all networks were trained on Fashion MNIST, with 5 104 training samples and 104 test samples, using ADAM with a learning rate of 0.01. Training was run for 20 epochs with a batch size of 512. [...] Both bias and mask networks were trained on MNIST, with 5 104 training samples and 104 test samples, using ADAM with a learning rate of 0.01. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like ADAM, Pytorch, and Scikit-Learn, but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | For figure 1A, all networks were trained on Fashion MNIST, with 5 104 training samples and 104 test samples, using ADAM with a learning rate of 0.01. Training was run for 20 epochs with a batch size of 512. Xavier uniform initialization with a gain of 1.0 was used for the fully trained networks, while the frozen weights for the bias-only networks were sampled from either a uniform distribution on [ 0.1, 0.1] or from a zero-mean Gaussian with standard deviation 1/d, where d is the input dimension. [...] Both bias and mask networks were trained on MNIST, with 5 104 training samples and 104 test samples, using ADAM with a learning rate of 0.01. Trained parameters were each initialized uniformly on [ 0.01, 0.01] (for the mask learned networks the bias vector was initialized in this way and left untrained), and training was run for 30 epochs with a batch size of 512. [...] Learning rates were 0.1 for bias learning and 0.001 for the fully-trained network; other parameters for the Adam optimizer were left at their default values in Pytorch. [...] The training optimizer was Adam with a learning rate of 0.001 and a weight decay value of 0.1. [...] Hyperparameters γv = 0.2, and γf = 0.04 controlled the relative weight of the velocity and force costs with respect to the position loss. The learning rate of the standard Adam optimizer was set to 3 10 3 and training was stopped when a loss of 5 10 3 was reached. |