Gaussian Pre-Activations in Neural Networks: Myth or Reality?
Authors: Pierre Wolinski, Julyan Arbel
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network depth, even in narrow neural networks, under the assumption that the pre-activations are independent. In the process, we discover a set of constraints that a neural network should satisfy to ensure Gaussian pre-activations. In addition, we provide a critical review of the claims of the Edge of Chaos line of work and construct a non-asymptotic Edge of Chaos analysis. We also propose a unified view on the propagation of pre-activations, encompassing the framework of several well-known initialization procedures. More generally, our work provides a principled framework for addressing the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are guaranteed to be Gaussian? Our code is available on Git Hub: https://github.com/p-wol/gaussian-preact/. ... First, we experimentally test their Gaussianity, i.e., whether the Gaussian hypothesis holds or not. We show that it does not hold in many cases. Second, we construct families of activation functions and initialization distributions to make this Gaussian hypothesis hold, and we show their construction process. ... we experimentally demonstrate that the Gaussian hypothesis is mostly invalid in multilayer perceptrons with finite width (Section 3.1 and Section 3.2); ... we empirically demonstrate that, with our activation functions (ϕp θ)θ, the distribution of the preactivations remains Gaussian during propagation, while it drifts away from the standard Gaussian when using tanh and Re LU (Sections 5.1 and 5.2); ... Finally, we propose in Section 5 a series of simulations in order to check whether our propositions meet the requirement of maintaining Gaussian pre-activations across neural networks. We also show the performance of trained neural networks in different setups, including standard ones and the one we are proposing. |
| Researcher Affiliation | Academia | Pierre Wolinski EMAIL LAMSADE, Paris-Dauphine University, PSL University, CNRS, 75016 Paris, France; Julyan Arbel EMAIL Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France |
| Pseudocode | No | The paper describes methodologies using mathematical equations and text, but it does not contain any explicitly labeled pseudocode or algorithm blocks. Procedures are described in narrative form or via mathematical derivations. |
| Open Source Code | Yes | Our code is available on Git Hub: https://github.com/p-wol/gaussian-preact/. |
| Open Datasets | Yes | Propagation of the correlations. First, we have sampled randomly 10 data points in each of the 10 classes of the CIFAR-10 dataset, that is 100 in total for each dataset. Then, for each tested neural network (NN) architecture, we repeated ninit = 1000 times the following operation: (i) sample the parameters according to the EOC;7 (ii) propagate the 100 data points in the NN. Thereafter, for each pair (xa, xb) of the selected 100 data points, we have computed the empirical correlation cl ab between the obtained preactivations, averaged over the ninit samples. Finally, we have averaged the results over the classes: the matrix Cl pq plotted in Figure 1 shows the mean of the correlation cl ab for data points xa and xb belonging respectively to classes p and q in {0, , 9}.8 Only the experiments with CIFAR-10 are reported in Figure 1; the results on MNIST, which are similar, are reported in Figure 14 in Appendix H.1. |
| Dataset Splits | Yes | Training, validation, and test sets. For MNIST and CIFAR-10, we split randomly the initial training set into two sets: the training set, which will be actually used to train the neural network, and the validation set, which will be used to stop training when the network begins to overfit. The sizes of the different sets are as follows: MNIST: 50000 training samples; 10000 validation samples; 10000 test samples; CIFAR-10: 42000 training samples, 8000 validation samples; 10000 test samples. The training sets are split into mini-batches with 200 samples each. No data augmentation is performed. |
| Hardware Specification | No | This work was granted access to the HPC resources of IDRIS under the allocation 2024-AD011013762R2 made by GENCI. |
| Software Dependencies | No | The paper mentions "Py Torch" in Section F and "Adam optimizer" in Section F and H.5, but does not provide specific version numbers for these or any other software components. |
| Experiment Setup | Yes | In the following, we train all the neural networks with the same optimizer, Adam, and the same learning rate η = 0.001. We use a scheduler and an early stopping mechanism, respectively based on the training loss and the validation loss, the test loss not being used during training. We did not use data augmentation. All the technical details are provided in Appendix H.5. ... Optimizer. We use the Adam optimizer (Kingma & Ba, 2015) with the parameters: learning rate = 0.001; β1 = 0.9; β2 = 0.999; weight decay = 0. ... Learning rate scheduler. We use a learning rate scheduler based on the reduction of the training loss. If the training loss does not decrease at least by a factor 0.01 for 10 epochs, then the learning rate is multiplied by a factor 1/√3 10. ... Early stopping. We add an early stopping rule based on the reduction of the validation loss. If the validation loss does not decrease at least by a factor of 0.001 for 30 epochs, then we stop training. ... The training sets are split into mini-batches with 200 samples each. |