Latent State Models of Training Dynamics
Authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the L2 norm, mean, and variance of the neural network s weights. We then fit a hidden Markov model (HMM; Baum & Petrie, 1966) over the resulting sequences of metrics. ... we train HMMs on training trajectories derived from grokking tasks, language modeling, and image classification across a variety of model architectures and sizes. |
| Researcher Affiliation | Collaboration | Michael Y. Hu EMAIL New York University Angelica Chen EMAIL New York University Naomi Saphra EMAIL New York University Kyunghyun Cho EMAIL New York University Prescient Design, Genentech CIFAR LMB |
| Pseudocode | No | The paper describes the methodology and algorithms in prose and mathematical formulas, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures. |
| Open Source Code | Yes | Our code is available at https://github.com/michahu/modeling-training. |
| Open Datasets | Yes | We collect 40 runs of ResNet18 (He et al., 2016) trained on CIFAR-100 (Krizhevsky, 2009)... The dynamics of MNIST are similar to that of CIFAR-100. We collect 40 training runs of a two-layer MLP learning image classification on MNIST, with hyperparameters based on Simard et al. (2003). |
| Dataset Splits | Yes | We collect trajectories using 40 random seeds and train and validate the HMM on a random 80-20 validation split, a split that we use for all settings. ... Training data size 50000 (splits downloaded from PyTorch) ... Training data size 60000 (splits downloaded from PyTorch) |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or cloud instance types used for running the experiments. It only mentions general training processes. |
| Software Dependencies | No | The paper mentions software components like "PyTorch" and optimizers like "Adam W" and "SGD," but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all hyperparameter details, see Appendix D. ... Appendix D Training Hyperparameters: Hyperparameter Value Learning Rate 1e-1 Batch Size 32 Training data size (randomly generated) 1000 Architecture Multilayer perceptron Number of hidden layers 1 Model Hidden Size 128 Weight Decay 0.01 Seed 0 through 40 Optimizer SGD |