Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation
Authors: Leander Weber, Jim Berend, Moritz Weckbecker, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish the convergence of LFP theoretically and empirically, demonstrating its effectiveness on various models and datasets. Via two applications neural network pruning and the approximation-free training of Spiking Neural Networks (SNNs) we demonstrate that LFP combines increased efficiency in terms of computation and representation with flexibility w.r.t. choice of model architecture and objective function. In this section, we demonstrate theoretically and experimentally in the context of supervised classification that LFP converges, and can successfully train ML models. First, we show the following theorem: Theorem 1. For a differentiable loss function L and any Re LU-activated network, LFP-0 with initial reward ... is equivalent to weight-scaled gradient descent... For the detailed proof, refer to Appendix A.2. In a first experiment, we show empirically that LFP is able to train models in simple supervised classification settings. For this purpose, we train a small Re LU-activated MLP (fulfilling the proof-conditions of Theorem 1 except for the LFP-0 rule) on three toy datasets (cf. Figure 3 left) using LFP-ε (Equation 8). For more details about the specific setup, refer to Appendix A.7.1. Figure 3 shows the decision boundaries, accuracies, and weight distributions of the resulting models. |
| Researcher Affiliation | Academia | 1 Fraunhofer Heinrich Hertz Institute, Berlin, Germany 2 Otto-von-Guericke University, Magdeburg, Germany 3 Singapore Institute of Technology, Singapore, Singapore 4 Technische Universität Berlin, Berlin, Germany 5 BIFOLD Berlin Institute for the Foundations of Learning and Data, Berlin, Germany 6 Centre of e Xplainable Artificial Intelligence, Technological University Dublin, Dublin, Ireland corresp.: EMAIL |
| Pseudocode | No | The paper describes the Layer-wise Feedback Propagation (LFP) method in detail using prose and mathematical equations (e.g., Section 3.1.1 Reward Propagation and Parameter Update). Figure 1 visually summarizes the approach with steps A, B, and C, but it is a diagrammatic representation, not pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/leanderweber/layerwise-feedback-propagation. |
| Open Datasets | Yes | We establish the convergence of LFP theoretically and empirically, demonstrating its effectiveness on various models and datasets. In a first experiment, we show empirically that LFP is able to train models in simple supervised classification settings. Here, we train a small Re LU-activated MLP (fulfilling the proof-conditions of Theorem 1 except for the LFP-0 rule) on three toy datasets (cf. Figure 3 left) using LFP-ε (Equation 8). We further used the VGG-16 (Simonyan & Zisserman, 2014) and Res Net-18 (He et al., 2016) models, initialized by Image Net weights available in torchvision (Paszke et al., 2019). Out of the above models, we trained the VGG-like model on the CIFAR10 and CIFAR100 tasks (Krizhevsky, 2009) for 50 epochs. The VGG-16 and Res Net-18 models were trained on the Caltech-UCSD Birds-200-2011 (CUB) (Wah et al., 2011), ISIC 2019 (Tschandl et al., 2018; Codella et al., 2018; Combalia et al., 2019) skin lesion, and Food-1115 classification datasets. 15https://www.kaggle.com/datasets/vermaavi/food11 To demonstrate the effectiveness of LFP, we trained two types of SNN architectures on the MNIST handwritten digit classification task We provide an initial experiment training a small MLP regressor for 100 epochs on the California Housing Dataset with LFP fine-tune it on a small bean disease classification dataset19 for 100 epochs, altering all parameters except for the embeddings. 19https://huggingface.co/datasets/AI-Lab-Makerere/beans |
| Dataset Splits | Yes | Blob Dataset: 1000 training samples and 100 test samples were drawn using the scikit-learn function skdata. make_blobs with two classes, centered at [1, 1] and [2, 2], respectively, and a cluster standard deviation of 0.2. Models were trained on this dataset for 10 epochs. Circle Dataset: 10000 training samples and 500 test samples were drawn using the scikit-learn function skdata. make_circles with two classes, using a cluster standard deviation of 0.2 and a scale factor of 0.05. Models were trained on this dataset for 10 epochs. Swirl Dataset: 10000 training samples and 500 test samples were drawn for three classes in the shape of a swirl, similar to the toy dataset used in Yeom et al. (2021). Models were trained on this dataset for 15 epochs. |
| Hardware Specification | Yes | Of the experiments in this work, the ones in Sections 3.3, 3.4, and 3.6, as well as Appendix A.1, A.3, A.4, A.8, and A.9 ran on a local machine, while all other experiments ran on an HPC Cluster. The local machine used Ubuntu 20.04.6 LTS, an NVIDIA TITAN RTX Graphics Card with 24GB of memory, an Intel Xeon CPU E5-2687W V4 with 3.00GHz, and 32GB of RAM. The HPC-Cluster used Ubuntu 18.04.6 LTS, an NVIDIA A100 Graphics Card with 40GB of memory, an Intel Xeon Gold 6150 CPU with 2.70GHz, and 512GB of RAM. |
| Software Dependencies | No | The code for all experiments was implemented in python, using Py Torch (Paszke et al., 2019) for deep learning, including weights available from torchvision, as well as snn Torch (Eshraghian et al., 2021) for SNN applications. matplotlib was used for plotting. The implementation of LFP builds upon zennit (Anders et al., 2021) and LXT (Achtibat et al., 2024), two XAI-libraries. |
| Experiment Setup | Yes | Small Re LU-activated MLPs were trained on three toy datasets with LFP-ε and reward #3 from Table 2 (as well as stochastic gradient descent with categorical cross-entropy loss) using a batch-size of 128 and momentum of 0.95. The MLPs each consist of three dense layers with 32, 16, and n neurons, respectively, and are visualized to the right of Figure 3. n refers to the number of classes, which vary between 2 and 3 depending on the dataset. The models were trained for a large number of different learning rates, chosen according to the following formula: a 10b, with a [1, 2, ..., 10] and b [ 10, 9, ..., 3]. We trained this model for one epoch using a learning rate of 0.5, batch size of 8, and no momentum. We utilize a batch-size of 128, learning rates 0.1 and 0.01 for the MLP and Le Net, respectively, a momentum of 0.9, an SGD optimizer and the reward #3 from Table 2. We utilize a batch-size of 128, learning rates 0.1 and 0.01 for the MLP and Le Net, respectively, a momentum of 0.9, an SGD optimizer and a categorical crossentropy loss. We utilize a batch-size of 1000, cognitive and social coefficients of 2, an inertial weight of 0.8, 1000 particles, a search space of [ 0.1, 0.1] and categorical crossentropy loss as a measure of fitness. All models were trained multiple times, with Re LU, Si LU (Elfwing et al., 2018), ELU (Clevert et al., 2016), Tanh, Sigmoid, and Heaviside step functions for hidden layer activations. Results are averaged over three randomly generated seeds. We further used the same training setups as in Appendix A.7.3: Non-Re LU Activations, except for slightly different learning rates: On both CIFAR10 and CIFAR100, we trained with a max learning rate of 0.1. For VGG-16 and VGG-16-BN, we used a max learning rate of 1e 3 and for Res Net-18 and Res Net-32, a max learning rate of 5e 3. For the LIFs activation function, we adapted the implementation from Eshraghian et al. (2021). To enable gradient-based training, we employed a surrogate function s: R R in the backward pass, defined as: s(x) = x / (1 + 25 |x|). We furthermore used a fixed LIF threshold of θ=1 and varying the decay factor β to values of 0.3, 0.6, and 0.9. The sequence length L was varied across 5, 15, and 25 time steps. Both networks were trained for three epochs using a batch size of 128. For optimization, we used the stochastic gradient descent algorithm. Learning rates of 10-3, 10-2, 5×10-2, 7.5×10-2, 0.1, 0.25, 0.5 and 0.8 were investigated in combination with the one-cycle-lr-scheduler proposed in Smith & Topin (2019). For stability, in this experiment, the backward pass was normalized by the maximum absolute value between layers. Additionally, we utilized the adaptive gradient clipping strategy proposed in Brock et al. (2021). The experiments were performed across three different random seeds, with results averaged. We investigated learning rates of 1e 3, 5e 4, and 1e 4 for gradient descent and LFP, where the setting where the maximum accuracy was reached (lr 5e 4) are reported in the figure. For R-STDP we used the hyperparameters of the reference implementation, only changing the number of time-steps to 25. We trained the LFP and gradient descent models for 50 epochs each. |