On the Importance of Embedding Norms in Self-Supervised Learning
Authors: Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, Erik J Bekkers
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Aarhus University, Denmark 2IAS-8, J ulich Forschungszentrum, Germany 3AMLab, University of Amsterdam, The Netherlands 4Hertie Institute for AI in Brain Health, University of T ubingen, Germany 5Mila, Quebec AI Institute, Canada. |
| Pseudocode | Yes | Algorithm 1 Pytorch-like pseudo-code using the gradient scaling layer |
| Open Source Code | Yes | Our code is available at https://github. com/Andrew-Draganov/SSLEmbedding Norms. |
| Open Datasets | Yes | On the left side of the figure, we trained Sim CLR and Sim Siam models on the Cifar-10 train set... In contrast, the Cifar-100 data splits... We then normalize the embedding magnitudes by the maximum across the dataset and bucket the embeddings into ranges of 0.05, giving us 20 embedding buckets over the dataset. Figure 4 (left) then shows the per-bucket accuracy of a k NN classifier which was fit on all the embeddings with respect to the cosine similarity metric. Indeed, we see that the k NN classifier s accuracy shows a clear monotonic trend with the embedding norms across datasets and SSL models. Embedding Norms Encode Human Confidence Interestingly, not only does the embedding norm provide a measure for the sample s novelty and its classification accuracy, but it also provides a signal for human labelers confidence and their agreement among one another. Using the Cifar-10-N and Cifar-100-N labels from Wei et al. (2021)... Similarly, the Cifar-10-H dataset from Peterson et al. (2019)... Our experiments are on the Cifar-10, Cifar-100, Imagenet-100 and Tiny-Imagenet (Le & Yang, 2015) datasets. |
| Dataset Splits | Yes | On the left side of the figure, we trained Sim CLR and Sim Siam models on the Cifar-10 train set for 512 epochs and then compared the embedding norms across different data splits, normalizing all embedding norms by the Cifar-10 train set mean. The results reveal a clear pattern: embedding norms decrease progressively with increasing distributional distance from the training data. For example, the Cifar-10 test set contains novel but distributionally similar samples and therefore results in only slightly reduced norms. In contrast, the Cifar-100 data splits exhibit substantially smaller norms due to their greater distributional shift. This relationship holds symmetrically when training on Cifar-100 and evaluating on Cifar-10, as seen on the right side of Figure 3. ... For Cifar-10, we use the exponential split from Van Assel & Balestriero (2024), where class i has ni = 5000 1.5 i samples. Similarly for Cifar-100, the i-th class receives ni = 500 1.034 i samples. This way, all classes are represented and both imbalanced datasets contain roughly 15K samples. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or memory) are provided. The paper only mentions 'computational constraints'. |
| Software Dependencies | No | No specific software versions (e.g., PyTorch 1.x, Python 3.x) are provided. The paper refers to 'PyTorch code' and 'PyTorch Foundation' but without version numbers for key libraries. |
| Experiment Setup | Yes | Unless otherwise stated, we use a Res Net-50 backbone (He et al., 2016) and the default settings outlined in the Sim CLR (Chen et al., 2020a) and Sim Siam (Chen & He, 2021) papers. We use 1e-6 as the default Sim CLR weight decay and 5e-4 as the default Sim Siam one. ... We use embedding dimensionality 256 in Sim CLR and 2048 in Sim Siam. ... Due to computational constraints, we run with batch-size 256 in Sim CLR. Although each batch is still 256 samples in Sim Siam, we simulate larger batch sizes using gradient accumulation. Thus, our default batch-size for Sim Siam is 1024. Our base learning rate is set to 0.18 Batch Size/256 for Sim CLR and 0.12 Batch Size/256 for Sim Siam. We employ a 10-epoch linear warmup followed by cosine scaling. For all cases, we use k = 200 for the k-NN classifier and apply it on the normalized embeddings. ... To address this, we evaluate our embedding norm mitigation strategies using the Adam optimizer (Kingma, 2014). ... Namely, we choose base learning rate γ = γ/6 with 100 linear warmup epochs followed by cosine annealing, where γ is the default learning rate. |