Latent State Models of Training Dynamics

Authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the L2 norm, mean, and variance of the neural network s weights. We then fit a hidden Markov model (HMM; Baum & Petrie, 1966) over the resulting sequences of metrics. ... we train HMMs on training trajectories derived from grokking tasks, language modeling, and image classification across a variety of model architectures and sizes.
Researcher Affiliation Collaboration Michael Y. Hu EMAIL New York University Angelica Chen EMAIL New York University Naomi Saphra EMAIL New York University Kyunghyun Cho EMAIL New York University Prescient Design, Genentech CIFAR LMB
Pseudocode No The paper describes the methodology and algorithms in prose and mathematical formulas, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code Yes Our code is available at https://github.com/michahu/modeling-training.
Open Datasets Yes We collect 40 runs of ResNet18 (He et al., 2016) trained on CIFAR-100 (Krizhevsky, 2009)... The dynamics of MNIST are similar to that of CIFAR-100. We collect 40 training runs of a two-layer MLP learning image classification on MNIST, with hyperparameters based on Simard et al. (2003).
Dataset Splits Yes We collect trajectories using 40 random seeds and train and validate the HMM on a random 80-20 validation split, a split that we use for all settings. ... Training data size 50000 (splits downloaded from PyTorch) ... Training data size 60000 (splits downloaded from PyTorch)
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processors, or cloud instance types used for running the experiments. It only mentions general training processes.
Software Dependencies No The paper mentions software components like "PyTorch" and optimizers like "Adam W" and "SGD," but it does not specify any version numbers for these or other software dependencies.
Experiment Setup Yes For all hyperparameter details, see Appendix D. ... Appendix D Training Hyperparameters: Hyperparameter Value Learning Rate 1e-1 Batch Size 32 Training data size (randomly generated) 1000 Architecture Multilayer perceptron Number of hidden layers 1 Model Hidden Size 128 Weight Decay 0.01 Seed 0 through 40 Optimizer SGD