Optimization and Generalization Guarantees for Weight Normalization

Authors: Pedro Cisneros-Velarde, Zhijie Chen, Sanmi Koyejo, Arindam Banerjee

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of Weight Norm networks.
Researcher Affiliation Collaboration Pedro Cisneros-Velarde EMAIL VMware Research Zhijie Chen EMAIL University of Illinois Urbana-Champaign Sanmi Koyejo EMAIL Stanford University Arindam Banerjee EMAIL University of Illinois Urbana-Champaign
Pseudocode No The paper provides mathematical derivations and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions PyTorch 2.0 as a library with built-in implementations of Weight Norm (Section 1) and provides a link to its documentation. However, it does not state that the authors are releasing their own code for the methodology described in the paper.
Open Datasets Yes We do empirical evaluations on CIFAR-10 (Krizhevsky, 2009) and MNIST (Deng, 2012).
Dataset Splits No The paper mentions evaluating on CIFAR-10 and MNIST datasets and using mini-batch SGD, but it does not specify the exact training, validation, or test splits used for these datasets.
Hardware Specification Yes Our experiments were conducted on a computing cluster with AMD EPYC 7713 64-Core Processor and NVIDIA A100 Tensor Core GPU.
Software Dependencies Yes Pytorch 2.0. Pytorch 2.0 documentation. https://pytorch.org/docs/stable/generated/torch.nn.utils. weight_norm.html. Accessed: 05-09-2023.
Experiment Setup Yes We apply mini-batch stochastic gradient descent (SGD) with batch size 512 to optimize the Weight Norm networks under mean squared loss. ... with learning rate 0.001, and weights initialized independently from a uniform distribution [ 0.5 m, 0.5 m]. ... for two different widths m {512, 1024} on the MNIST dataset, ... the weights are initialized with a uniform distribution [ 5 m, 5 m].