reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ESE: Espresso Sentence Embeddings

Authors: Xianming Li, Zongxi Li, Jing Li, Haoran Xie, Qing Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on STS and RAG suggest that ESE can effectively produce high-quality sentence embeddings with less model depth and embedding size, enhancing embedding inference efficiency. The code is available at https://github.com/Sean Lee97/Angl E/blob/main/README_ESE.md.
Researcher Affiliation	Academia	1 Department of Computing, 3 Research Centre on Data Science & Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR 2 School of Data Science, Lingnan University, Hong Kong SAR
Pseudocode	No	The paper describes the methodologies using natural language, mathematical equations, and diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/Sean Lee97/Angl E/blob/main/README_ESE.md.
Open Datasets	Yes	For the datasets, we train sentence embeddings on Multi NLI (Williams et al., 2018) and SNLI (Bowman et al., 2015) datasets following previous studies. Evaluation of model performance is conducted on the STS benchmark computed by Sent Eval (Conneau & Kiela, 2018), where Spearman s correlation in the all setting is reported as the evaluation metric. To enable comprehensive evaluation, this benchmark comprises seven widely used STS datasets: STS 2012-2016 (Agirre et al., 2012; 2013; 2014; 2015; 2016), SICK-R (Marelli et al., 2014), and STS-B (Cer et al., 2017). ...on Hotpot QA dataset.
Dataset Splits	Yes	For the datasets, we train sentence embeddings on Multi NLI (Williams et al., 2018) and SNLI (Bowman et al., 2015) datasets following previous studies. Evaluation of model performance is conducted on the STS benchmark computed by Sent Eval (Conneau & Kiela, 2018)...
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models. It mentions fine-tuning LLMs with LoRA but no hardware specifications.
Software Dependencies	No	The paper mentions using 'Lo RA' and 'faiss', but does not provide specific version numbers for these software components or any other libraries/frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	The initial learning rates are set to 5e 5 and 2e 4 to train BERT-based and LLM-based models, respectively. For efficient LLM fine-tuning, we utilize Lo RA (Hu et al., 2021; Dettmers et al., 2024) with parameters lora r = 32, lora alpha = 32, and lora dropout = 0.1. For the ESE setup, the compression dimension k is set to 128 by default, and the weights α and β for joint learning (Eq. 7) are set to 1.