Grokking at the Edge of Numerical Stability

Authors: Lucas Prieto, Melih Barsbey, Pedro Mediano, Tolga Birdal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show our findings on the most commonly studied grokking datasets, outlined in this section. I. Modular arithmetic. The main results in this paper are shown on arithmetic modulo 113... II. Sparse parity. We also validate some of our results on the Sparse Parity task... III. MNIST. Finally, we provide some results on a subset the classic image classification dataset MNIST... Figure 2: As dataset size increases (subplots a to c), MLPs trained on modular addition begin to generalize...
Researcher Affiliation Academia Lucas Prieto, Melih Barsbey, Pedro A.M. Mediano , Tolga Birdal Department of Computing Imperial College London
Pseudocode No Definition 7 ( Grad). We propose the following update rule for a given iteration t N: θt+1 = θt η L(θt), where the orthogonal component of the gradient, L(θt), is obtained by projection onto the hyperplane orthogonal to the current weight vector: L(θt) = L(θt) θ t L(θt) / ||θt||^2 θt.
Open Source Code Yes Code for this paper can be found at: https://github.com/Lucas Prieto Al/ grokking-at-the-edge-of-numerical-stability.
Open Datasets Yes We show our findings on the most commonly studied grokking datasets, outlined in this section. I. Modular arithmetic. The main results in this paper are shown on arithmetic modulo 113 (Power et al., 2022; Nanda et al., 2023)... II. Sparse parity. We also validate some of our results on the Sparse Parity task outlined in Barak et al. (2022)... III. MNIST. Finally, we provide some results on a subset the classic image classification dataset MNIST (Deng, 2012).
Dataset Splits Yes Our main results use a 40%/60% train/test split but we also include results using 60%/40% and 70%/30%. The input integers are represented as one-hot vectors. II. Sparse parity. ... In this work we use 2000 samples, split evenly between train and test data... III. MNIST. ... we use a subset of 200 training samples from the training set as in Liu et al. (2023b), with evaluation on the full test set.
Hardware Specification No No specific hardware details are provided. The paper discusses 'Floating Point (FP) arithmetic' and 'float32', 'float64', 'float16' but not the underlying hardware.
Software Dependencies No No specific software versions are provided. The paper mentions 'Adam W and SGD, as well as our own variants of these optimizers, Adam W and SGD', 'Re LU activations and cross-entropy loss', 'torch.nn.functional.log_softmax' and 'CUDA kernels' but without version numbers.
Experiment Setup Yes We study the grokking phenomenon on these datasets using a 2-hidden layer multi-layer perceptron (MLP) of width 200 as in Liu et al. (2023a) and a one-layer transformer with 4 attention heads as Nanda et al. (2023) and Power et al. (2022). We train both of these models in a full batch setting, using Re LU activations and cross-entropy loss with Adam W and SGD, as well as our own variants of these optimizers, Adam W and SGD. Unless specified otherwise we set the weight decay parameter λ = 0. For modular arithmetic datasets, inputs are concatenated as the input of the MLP resulting in a 226 dimensional vector... We train GPT2-Small for 1 epoch on Wiki Text-103 using a batch size of 16, a block size of 512, a learning rate of 5e-4 and a weight decay of 0.01 using Adam W. The architecture is the regular GPT2-Small architecture from Radford et al. (2019), trained with a cosine schedule and 1000 steps of warm-up. For CIFAR10, CIFAR100 and Imagenet-1k (Russakovsky et al., 2015), our baseline is a Res Net18 with SCE loss trained with SGD 0.9 momentum and 1e-4 weight decay. We use standard data transformations such as random crop and random horizontal flip and a step learning rate scheduler every 30 epochs for a full training run of 100 epochs.