reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Similarities of Embeddings in Contrastive Learning

Authors: Chungpa Lee, Sehee Lim, Kibok Lee, Jy-Yong Sohn

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that incorporating the proposed loss improves performance in small-batch settings. In this section, we empirically validate the impact of our theoretical results discussed in Sec. 5, especially for the practical scenarios of mini-batch settings. First, we empirically observe that the excessive separation of negative pairs (proven in Theorem 5.5) actually happens in experiments on benchmark datasets. Second, we empirically confirm that such excessive separation issue can be mitigated by using the proposed loss in Def. 5.7 which reduces the variance of the negative-pair similarities. Third, we observe such variance reduction improves the quality of learned representations in various real-world experiments.
Researcher Affiliation	Academia	Department of Statistics and Data Science, Yonsei University, Seoul, Korea. Correspondence to: Jy-yong Sohn <EMAIL>
Pseudocode	No	The paper defines contrastive loss formulations (Definition 3.1 and 3.2) and a variance reduction auxiliary loss (Definition 5.7), but it does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is based on the open-source library solo-learn (da Costa et al., 2022) for self-supervised learning. The source code is available at https://github.com/leechungpa/embedding-similarity-cl/.
Open Datasets	Yes	The models are pretrained on CIFAR-100 (Krizhevsky et al., 2009)... We pretrain models on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009) using various contrastive losses...
Dataset Splits	Yes	We pretrain models on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009) using various contrastive losses... For the linear evaluation protocol, we remove the projector head and using the pretrained encoder for downstream classification tasks... we report top 1 accuracy on the downstream dataset.
Hardware Specification	Yes	All experiments were conducted using a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	Our implementation is based on the open-source library solo-learn (da Costa et al., 2022) for self-supervised learning. The paper mentions this library but does not provide specific version numbers for it or any other key software dependencies (e.g., Python, PyTorch).
Experiment Setup	Yes	For all experiments, we use modified Res Net-18... and Res Net-50 for Image Net-100... we attach a 2-layer MLP as the projection head... For the CIFAR datasets, the crop size is set to 32, while for Image Net-100, we use a crop size of 224... we use stochastic gradient descent (SGD) for 200 epochs. The learning rate is scaled linearly with the batch size as lr Batch Size/256, where the base learning rate is set to 0.3 for the CIFAR datasets and 0.1 for Image Net-100. A cosine decay schedule is applied, with a weight decay of 0.0001 and SGD momentum set to 0.9. Additionally, we use linear warmup for the first 10 epochs. We tune the temperature parameter for baseline methods... by performing a grid search over the range of 0.1 to 0.5 in increments of 0.1... For tuning the proposed loss LVRNS(U, V) in Def. 5.7, we conducted a grid search for λ from the set {0.1, 0.3, 1, 3, 10, 30, 100}.