Gaussian mixture layers for neural networks

Authors: Sinho Chewi, Philippe Rigollet, Yuling Yan

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a proof of concept, we validate our proposal through experiments on simple classification tasks, where a GM layer achieves test performance comparable to that of a two-layer fully connected network. Furthermore, we examine the behavior of these dynamics and demonstrate numerically that GM layers exhibit markedly different behavior compared to classical fully connected layers, even when the latter are large enough to be considered in the mean-field regime. We conduct numerical experiments in Section 5 on the MNIST and Fashion-MNIST datasets
Researcher Affiliation Academia Sinho Chewi EMAIL Department of Statistics and Data Science Yale University Philippe Rigollet EMAIL Department of Mathematics Massachusetts Institute of Technology Yuling Yan EMAIL Department of Statistics University of Wisconsin Madison
Pseudocode No No explicit pseudocode or algorithm blocks are present. The paper contains mathematical derivations in Appendix A but these are not formatted as pseudocode.
Open Source Code Yes The source code for all experiments in this section is available at https://github.com/yulingy/GM_layer.
Open Datasets Yes Dataset. We test the performance of neural networks with GM layers on multi-class classification on two widely used datasets: MNIST (Le Cun & Cortes, 2010) and Fashion-MNIST (Xiao et al., 2017). Both datasets consist of 60,000 training examples and 10,000 test examples, where each example is a 28 x 28 grayscale image, associated with a label from one of 10 classes. ... Finally, we conduct an experiment based on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009)...
Dataset Splits Yes Both datasets consist of 60,000 training examples and 10,000 test examples, where each example is a 28 x 28 grayscale image, associated with a label from one of 10 classes.
Hardware Specification Yes The numerical experiments conducted in this paper are implemented with Py Torch in Python, using a 2023 Mac Book Pro with Apple M2 Pro chip and 32GM memory.
Software Dependencies No The numerical experiments conducted in this paper are implemented with Py Torch in Python, using a 2023 Mac Book Pro with Apple M2 Pro chip and 32GM memory. The fully-connected layers are implemented using Py Torch s built-in functions. The GM layers are implemented using the derivation in Appendix A (thanks to Py Torch s Automatic Differentiation engine, we only need to implement the loss function, and there is no need to implement the gradients explicitly). Specific version numbers for Python or PyTorch are not provided.
Experiment Setup Yes Setup. The number of components K is a hyperparameter: larger K enables more expressive GM layers, while smaller K speeds up computation. We consider K {5, 10, 20} for different experiments. For a GM layer with parameters µβ, σ, U, and v, we initialize the entries of µβ, U and v i.i.d. from N(0, γ2), and the entries of σ all equal to γ, for some γ > 0. For most of the experiments we fix γ = 1/2, but by adjusting the value of γ we can investigate the role of the initialization scale, as discussed below. We train the network using SGD with batch size 64 and fixed learning rate 1 for the parameter σ and 0.1 for all other parameters. All of these choices are used for simplicity and were found via minimal tuning.