Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Grokking phase transitions in learning local rules with gradient descent

Authors: Bojan Žunkovič, Enej Ilievski

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze the rule-30 cellular automaton learning task, numerically determine the critical exponent and the grokking time distribution, and compare them with the prediction of the proposed grokking model. Finally, we numerically study the connection between structure formation and grokking. ... 4.3 Simulation details and results
Researcher Affiliation Academia Bojan ˇZunkoviˇc EMAIL University of Ljubljana Faculty of Computer and Information Science 1000 Ljubljana, Slovenia. Enej Ilievski EMAIL University of Ljubljana Faculty of Mathematics and Physics 1000 Ljubljana, Slovenia
Pseudocode No The paper describes methods and models using mathematical equations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide any links to code repositories. The licensing information provided is for the paper itself, not its code.
Open Datasets No The paper describes how the data for the D-dimensional ball model is generated: 'First, we sample a D-dimensional normal distribution with zero mean and unit variance. Then, we shift the positive samples in the direction of the first coordinate (parallel to ε) and the second coordinate (perpendicular to ε) for the same amount. We shift the negative samples in the opposite direction to the positive samples. After the shift, we normalize the samples and multiply them with a square root of a random number in the interval [r0, 1].' It also refers to the 'rule-30 cellular automaton learning task', but no concrete access information (link, DOI, repository) is provided for the specific datasets used in their experiments. The datasets appear to be procedurally generated or based on theoretical models rather than external public datasets.
Dataset Splits No The paper states, 'We sample N positive and N negative samples and then train the model with gradient descent' for the training dataset. For testing, it mentions, 'The test error is calculated as a statistical average of incorrectly classified samples over the probability distributions P .' and 'The test set will include all possible input sizes from Mtest = 3, 4, . . . , .' While these describe the nature of training and testing data, the paper does not provide specific percentages, sample counts, or explicit splitting methodologies for training, validation, and test sets in the conventional sense required for reproducibility.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory, or specific computing platforms) used for running the experiments or simulations.
Software Dependencies No The paper mentions using the 'Adam optimiser (with standard parameter setting)' but does not specify its version or any other software libraries or frameworks with version numbers that would be necessary to reproduce the experimental setup.
Experiment Setup Yes We initialise the model with a random initial condition, where all the entries of the tensors A, B are uncorrelated and sampled according to a normal distribution with zero mean and unit variance. We train the model with the Adam optimiser (with standard parameter setting) and a learning rate of 0.005. We use the same loss as in the previous sections Eq. 4 with L1,2 regularisation strength λ1,2 [0, 0.001]. The regularisation strength is the same for the attention tensor A and the classifier tensor B. We add a sigmoid non-linearity before the final sign non-linearity to improve the training stability and reduce the training time.