Training with Mixed-Precision Floating-Point Assignments
Authors: Wonyeol Lee, Rahul Sharma, Alex Aiken
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and Image Net. Our method typically provides > 2 memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading offaccuracy. |
| Researcher Affiliation | Collaboration | Wonyeol Lee EMAIL Stanford University, USA Rahul Sharma EMAIL Microsoft Research, India Alex Aiken EMAIL Stanford University, USA |
| Pseudocode | Yes | Algorithm 1: Computing π with precision demotion Input: (f1, . . . , fn), (fn+1, . . . , fm), C, r /* Tensor grouping */ k = 1; Tk = for i = 1 to m do Tk = Tk {vi, dvi} if k n then { Tk = Tk {θi, dθi} } if fi is GEMM then { k = k + 1; Tk = } end /* Precision demotion */ (T 1, . . . , T k) = sort(T1, . . . , Tk) by decreasing size π(t) = C(t, hi) for all t TS for j = 1 to k do if ratiolo(π) r then { break } π(t) = C(t, lo) for all t T j end return π |
| Open Source Code | No | We have implemented our precision assignment technique using Py Torch (Paszke et al., 2019). Given a model and loss network, and a dataset, our implementation takes as parameters a precision-candidate assignment C and a lower bound r on the low-precision ratio; it then automatically assigns precisions to tensors (appearing in training) according to our technique and uses those assigned precisions in gradient computations. |
| Open Datasets | Yes | As benchmarks for our experiments, we use the image classification task and three datasets for the task: CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), and Image Net (Russakovsky et al., 2015). |
| Dataset Splits | Yes | We train all models in a standard way: we apply dynamic loss scaling (a standard technique used in low-precision floating-point training; see 4.2 for details) except for 32-bit training, and use standard settings (e.g., learning rate); see Appendix B for details. |
| Hardware Specification | Yes | All experiments were performed on NVIDIA V100 GPUs; total compute time for all experiments was 1,081 GPU days. |
| Software Dependencies | No | We have implemented our precision assignment technique using Py Torch (Paszke et al., 2019). ... We implement the rounding functions based on the QPy Torch library (Zhang et al., 2019), but a few extensions are required, e.g., to support exponent bias and signal overflows for dynamic loss scaling. |
| Experiment Setup | Yes | Four models on CIFAR-10 and CIFAR-100: We train the four models with a standard setup (kuangliu, 2021). In particular, we run the (non-Nesterov) SGD optimizer for 200 epochs with minibatch size of 128 (over 1 GPU), learning rate of 0.1, momentum of 0.9, weight decay of 5 * 10^-4, and the cosine annealing scheduler for learning rate. For dynamic loss scaling, we use initial scale of 2^16, growth factor of 2, back-offfactor of 0.5, and growth interval of 1 epoch, as suggested in Py Torch (Py Torch, 2022a). Shuffle Net-v2 on Image Net: We train the model with the default setup given in Py Torch s Git Hub repository (Py Torch, 2022c), except that we use larger minibatch size and learning rate as in (Goyal et al., 2017; Kalamkar et al., 2019; Krizhevsky, 2014; Py Torch, 2022d) to reduce the wall-clock time of training. In particular, we run the (non-Nesterov) SGD optimizer for 90 epochs with minibatch size of 1024 (over 16 GPUs), learning rate of 0.4, momentum of 0.9, weight decay of 10^-4, and the cosine annealing scheduler for learning rate. For dynamic loss scale, we use initial scale of 2^16, growth factor of 2, back-offfactor of 0.5, and growth interval of 0.5 epoch, as suggested in Py Torch (Py Torch, 2022a). |