Does SGD really happen in tiny subspaces?
Authors: Minhak Song, Kwangjun Ahn, Chulhee Yun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we rigorously examine the question (Q1) through systematic experiments. Quite surprisingly, our results reveal that the answer to the question is negative, as summarized below. In Section 3, we demonstrate that the observed alignment is spurious in the sense that the aligned component of the gradient is not beneficial for training, even though it constitutes the majority of the gradient. Specifically, we run a critical experiment where we modify SGD by projecting each update onto the dominant subspace; we call this Dom-SGD. Unexpectedly, Dom-SGD does not further decrease the training loss. |
| Researcher Affiliation | Collaboration | Minhak Song KAIST Math Kwangjun Ahn Microsoft Research Chulhee Yun KAIST AI |
| Pseudocode | No | The paper describes the modified SGD updates (Dom-SGD, Bulk-SGD) in mathematical notation within the main text (e.g., "θt+1 θt ηPk(θt)gt . (Dom-SGD)") but does not present them within a clearly labeled pseudocode block or algorithm figure. |
| Open Source Code | Yes | Furthermore, to facilitate replication and verification, the source code for the experiments is included in the attached supplementary material. This code contains scripts for reproducing the main results discussed in the paper, along with instructions for running the experiments. |
| Open Datasets | Yes | MNIST-5k: We use the first 5000 samples of MNIST dataset (Le Cun et al., 1998) for multi-class classification. CIFAR10-5k: We use the first 5000 samples of CIFAR10 dataset (Krizhevsky, 2009) for multi-class classification. SST2-1k: We use the first 1000 samples of SST2 dataset (Socher et al., 2013) for binary classification. |
| Dataset Splits | No | The paper specifies using subsets like "first 5000 samples of MNIST dataset" or "first 1000 samples of SST2 dataset." While these define the data used, it does not explicitly provide information on how these selected samples are further split into training, validation, or test sets for the experiments described in the main sections (3, 4, 5, 6). Appendix H mentions test accuracy for the *full* MNIST dataset, but this specific split information is not provided for the main experimental datasets. |
| Hardware Specification | Yes | All experiments were performed on a single server equipped with 4 NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | Our experiments were conducted using Pytorch (Paszke et al., 2019). While PyTorch is mentioned, a specific version number for the software is not provided. |
| Experiment Setup | Yes | Throughout this paper, all experiments are conducted using a constant learning rate. For experiments using SGD, we use a batch size of 50 for all experiments. MLP on MNIST-5k: 0.01, CNN on CIFAR10-5k: 0.001, Transformer on SST2-1k: 0.001. Specifically, we track the exponential moving average (EMA) of χk( L(θt)) values (EMA factor set to 0.9). |