Optimization and Generalization Guarantees for Weight Normalization
Authors: Pedro Cisneros-Velarde, Zhijie Chen, Sanmi Koyejo, Arindam Banerjee
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of Weight Norm networks. |
| Researcher Affiliation | Collaboration | Pedro Cisneros-Velarde EMAIL VMware Research Zhijie Chen EMAIL University of Illinois Urbana-Champaign Sanmi Koyejo EMAIL Stanford University Arindam Banerjee EMAIL University of Illinois Urbana-Champaign |
| Pseudocode | No | The paper provides mathematical derivations and theoretical analyses but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions PyTorch 2.0 as a library with built-in implementations of Weight Norm (Section 1) and provides a link to its documentation. However, it does not state that the authors are releasing their own code for the methodology described in the paper. |
| Open Datasets | Yes | We do empirical evaluations on CIFAR-10 (Krizhevsky, 2009) and MNIST (Deng, 2012). |
| Dataset Splits | No | The paper mentions evaluating on CIFAR-10 and MNIST datasets and using mini-batch SGD, but it does not specify the exact training, validation, or test splits used for these datasets. |
| Hardware Specification | Yes | Our experiments were conducted on a computing cluster with AMD EPYC 7713 64-Core Processor and NVIDIA A100 Tensor Core GPU. |
| Software Dependencies | Yes | Pytorch 2.0. Pytorch 2.0 documentation. https://pytorch.org/docs/stable/generated/torch.nn.utils. weight_norm.html. Accessed: 05-09-2023. |
| Experiment Setup | Yes | We apply mini-batch stochastic gradient descent (SGD) with batch size 512 to optimize the Weight Norm networks under mean squared loss. ... with learning rate 0.001, and weights initialized independently from a uniform distribution [ 0.5 m, 0.5 m]. ... for two different widths m {512, 1024} on the MNIST dataset, ... the weights are initialized with a uniform distribution [ 5 m, 5 m]. |