Effective post-training embedding compression via temperature control in contrastive training

Authors: georgiana dinu, Corey Barrett, Yi Xiang, Miguel Romero Calvo, Anna Currey, Xing Niu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start off by investigating the impact of the temperature on different text embedding tasks, where we specifically observe a trade-off between performance on retrieval and on clustering tasks as a function of τ. [...] We evaluate using the standard English MTEB benchmark (Muennighoff et al., 2023)... Results for retrieval and clustering are shown in Figure 2...
Researcher Affiliation Industry Georgiana Dinu, Corey Barrett , Yi Xiang, Miguel Romero Calvo, Anna Currey, Xing Niu Amazon, USA EMAIL Oracle, USA EMAIL
Pseudocode No The paper describes methods and formulas (e.g., LInfo NCE in Section 2, LMRL in Section 4, LTemp Agg in Section 5) but does not present them as structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using an existing architecture: 'We use the Code Sage architecture introduced in Zhang et al. (2024).2 Available at https://huggingface.co/codesage/codesage-base.' However, there is no explicit statement or link providing the source code for the specific methodology developed in this paper (temperature control in contrastive training for embedding compression).
Open Datasets Yes For the contrastive stage, we train on MS Marco (Bajaj et al., 2018; Wang et al., 2023), NQ (Karpukhin et al., 2020; Gao & Callan, 2021), NLI (Gao et al., 2022), Hotpot QA (Yang et al., 2018), FEVER (Thorne et al., 2018), MIRACL (Zhang et al., 2023), and Mr. Ty Di (Zhang et al., 2021), totaling approximately 2 million data points (see details in Appendix A). We use the training splits of these datasets released by Thakur et al. (2021).
Dataset Splits Yes We use the training splits of these datasets released by Thakur et al. (2021). [...] We evaluate using the standard English MTEB benchmark (Muennighoff et al., 2023), which contains a total 56 datasets categorized into eight tasks...
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only lists general training parameters in Appendix A (Figure 7).
Software Dependencies No The paper mentions 'tokenize the text with tiktoken' and 'optimizer Fused Adam', but does not specify version numbers for any software dependencies or libraries, which would be required for reproducible setup.
Experiment Setup Yes Figure 7: Additional training parameters. We use in-batch negatives with a batch size of 256 and homogenous sampling, meaning that the negative sample are drawn from the same training set. All models are tested after 2000 training steps. --max_seq_length 1024 --max_steps 3000 --warmup_steps 58 --base_global_batch_size 4096 --weight_decay 0.1 --base_learning_rate 5e-06 --lr_min_ratio 1e-01 --base_max_steps 3000 --lr_scheduler_type cosine --gradient_clip_val 1.0 --optimizer Fused Adam