reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Effective post-training embedding compression via temperature control in contrastive training

Authors: georgiana dinu, Corey Barrett, Yi Xiang, Miguel Romero Calvo, Anna Currey, Xing Niu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We start off by investigating the impact of the temperature on different text embedding tasks, where we specifically observe a trade-off between performance on retrieval and on clustering tasks as a function of τ. [...] We evaluate using the standard English MTEB benchmark (Muennighoff et al., 2023)... Results for retrieval and clustering are shown in Figure 2...
Researcher Affiliation	Industry	Georgiana Dinu, Corey Barrett , Yi Xiang, Miguel Romero Calvo, Anna Currey, Xing Niu Amazon, USA EMAIL Oracle, USA EMAIL
Pseudocode	No	The paper describes methods and formulas (e.g., LInfo NCE in Section 2, LMRL in Section 4, LTemp Agg in Section 5) but does not present them as structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using an existing architecture: 'We use the Code Sage architecture introduced in Zhang et al. (2024).2 Available at https://huggingface.co/codesage/codesage-base.' However, there is no explicit statement or link providing the source code for the specific methodology developed in this paper (temperature control in contrastive training for embedding compression).
Open Datasets	Yes	For the contrastive stage, we train on MS Marco (Bajaj et al., 2018; Wang et al., 2023), NQ (Karpukhin et al., 2020; Gao & Callan, 2021), NLI (Gao et al., 2022), Hotpot QA (Yang et al., 2018), FEVER (Thorne et al., 2018), MIRACL (Zhang et al., 2023), and Mr. Ty Di (Zhang et al., 2021), totaling approximately 2 million data points (see details in Appendix A). We use the training splits of these datasets released by Thakur et al. (2021).
Dataset Splits	Yes	We use the training splits of these datasets released by Thakur et al. (2021). [...] We evaluate using the standard English MTEB benchmark (Muennighoff et al., 2023), which contains a total 56 datasets categorized into eight tasks...
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only lists general training parameters in Appendix A (Figure 7).
Software Dependencies	No	The paper mentions 'tokenize the text with tiktoken' and 'optimizer Fused Adam', but does not specify version numbers for any software dependencies or libraries, which would be required for reproducible setup.
Experiment Setup	Yes	Figure 7: Additional training parameters. We use in-batch negatives with a batch size of 256 and homogenous sampling, meaning that the negative sample are drawn from the same training set. All models are tested after 2000 training steps. --max_seq_length 1024 --max_steps 3000 --warmup_steps 58 --base_global_batch_size 4096 --weight_decay 0.1 --base_learning_rate 5e-06 --lr_min_ratio 1e-01 --base_max_steps 3000 --lr_scheduler_type cosine --gradient_clip_val 1.0 --optimizer Fused Adam