reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Grokking at the Edge of Numerical Stability

Authors: Lucas Prieto, Melih Barsbey, Pedro Mediano, Tolga Birdal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show our findings on the most commonly studied grokking datasets, outlined in this section. I. Modular arithmetic. The main results in this paper are shown on arithmetic modulo 113... II. Sparse parity. We also validate some of our results on the Sparse Parity task... III. MNIST. Finally, we provide some results on a subset the classic image classification dataset MNIST... Figure 2: As dataset size increases (subplots a to c), MLPs trained on modular addition begin to generalize...
Researcher Affiliation	Academia	Lucas Prieto, Melih Barsbey, Pedro A.M. Mediano , Tolga Birdal Department of Computing Imperial College London
Pseudocode	No	Definition 7 ( Grad). We propose the following update rule for a given iteration t N: θt+1 = θt η L(θt), where the orthogonal component of the gradient, L(θt), is obtained by projection onto the hyperplane orthogonal to the current weight vector: L(θt) = L(θt) θ t L(θt) / \|\|θt\|\|^2 θt.
Open Source Code	Yes	Code for this paper can be found at: https://github.com/Lucas Prieto Al/ grokking-at-the-edge-of-numerical-stability.
Open Datasets	Yes	We show our findings on the most commonly studied grokking datasets, outlined in this section. I. Modular arithmetic. The main results in this paper are shown on arithmetic modulo 113 (Power et al., 2022; Nanda et al., 2023)... II. Sparse parity. We also validate some of our results on the Sparse Parity task outlined in Barak et al. (2022)... III. MNIST. Finally, we provide some results on a subset the classic image classification dataset MNIST (Deng, 2012).
Dataset Splits	Yes	Our main results use a 40%/60% train/test split but we also include results using 60%/40% and 70%/30%. The input integers are represented as one-hot vectors. II. Sparse parity. ... In this work we use 2000 samples, split evenly between train and test data... III. MNIST. ... we use a subset of 200 training samples from the training set as in Liu et al. (2023b), with evaluation on the full test set.
Hardware Specification	No	No specific hardware details are provided. The paper discusses 'Floating Point (FP) arithmetic' and 'float32', 'float64', 'float16' but not the underlying hardware.
Software Dependencies	No	No specific software versions are provided. The paper mentions 'Adam W and SGD, as well as our own variants of these optimizers, Adam W and SGD', 'Re LU activations and cross-entropy loss', 'torch.nn.functional.log_softmax' and 'CUDA kernels' but without version numbers.
Experiment Setup	Yes	We study the grokking phenomenon on these datasets using a 2-hidden layer multi-layer perceptron (MLP) of width 200 as in Liu et al. (2023a) and a one-layer transformer with 4 attention heads as Nanda et al. (2023) and Power et al. (2022). We train both of these models in a full batch setting, using Re LU activations and cross-entropy loss with Adam W and SGD, as well as our own variants of these optimizers, Adam W and SGD. Unless specified otherwise we set the weight decay parameter λ = 0. For modular arithmetic datasets, inputs are concatenated as the input of the MLP resulting in a 226 dimensional vector... We train GPT2-Small for 1 epoch on Wiki Text-103 using a batch size of 16, a block size of 512, a learning rate of 5e-4 and a weight decay of 0.01 using Adam W. The architecture is the regular GPT2-Small architecture from Radford et al. (2019), trained with a cosine schedule and 1000 steps of warm-up. For CIFAR10, CIFAR100 and Imagenet-1k (Russakovsky et al., 2015), our baseline is a Res Net18 with SCE loss trained with SGD 0.9 momentum and 1e-4 weight decay. We use standard data transformations such as random crop and random horizontal flip and a step learning rate scheduler every 30 epochs for a full training run of 100 epochs.