reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Making Text Embedders Few-Shot Learners

Authors: Chaofan Li, Minghao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Defu Lian, Yingxia Shao, Zheng Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we examine the effectiveness of the ICL Embedder training strategy and rethink the training methodologies for LLM-based embedding models. We focus on the following questions: RQ 1: How does our ICL Embedder perform in zero-shot and few-shot scenarios? RQ 2: How does the performance of our ICL Embedder compare to other LLM-based embedding methods? RQ 3: How does our ICL training strategy affect the performance of embedding models compared to normal ICL training strategy. RQ 4: Will changes in model architecture, such as bidirectional attention and mean pooling, improve the performance of ICL Embedder?
Researcher Affiliation	Academia	Chaofan Li1,2 , Minghao Qin2,3 , Shitao Xiao2 , Jianlyu Chen2,4, Kun Luo2,3, Defu Lian4 , Yingxia Shao1 , Zheng Liu2 1Beijing University of Posts and Telecommunications 2Beijing Academy of Artificial Intelligence 3Chinese Academy of Sciences 4University of Science and Technology of China EMAIL EMAIL EMAIL EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes methods and processes in narrative text and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We have publicly released our method at this repo.
Open Datasets	Yes	To ensure a fair comparison, we use the E5-Mistral dataset, which is employed to fine-tune both the E5-Mistral (Wang et al., 2023b) and LLM2Vec (Behnam Ghader et al., 2024). This dataset includes some in-domain retrieval datasets from MTEB, including Hotpot QA (Yang et al., 2018), FEVER (Thorne et al., 2018), MSMARCO passage ranking (Nguyen et al., 2016), NQ (Karpukhin et al., 2020) and Quora Duplicate Questions (Data Canary et al., 2017), as well as other publicly available retrieval datasets, including ELI5 (Fan et al., 2019), MIRACL (Zhang et al., 2023), MSMARCO document ranking (Nguyen et al., 2016), NLI (Gao et al., 2021), SQu AD (Karpukhin et al., 2020), Trivia QA (Karpukhin et al., 2020), Mr Ty Di (Zhang et al., 2021), Du Reader (Qiu et al., 2022), and T2Ranking.
Dataset Splits	No	The paper mentions using various datasets and states, "For tasks with training sets: We reserve a small subset of the training set for testing purposes." for evaluation. However, it does not provide specific train/test/validation split percentages, exact sample counts, or explicit citations to predefined splits for the main model training or detailed methodology for splitting the datasets for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using Low-Rank Adaptation (Lo RA) and the Mistral-7B model, but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	We fine-tune the Mistral-7B model using the contrastive loss and train it for a single epoch. For efficient fine-tuning, we employ Low-Rank Adaptation (Lo RA) (Hu et al., 2021), setting the Lo RA rank to 64 and the Lo RA alpha to 32, with a learning rate of 1e-4. For retrieval tasks, we use in-batch negatives. Each dataset incorporates 7 hard negatives. The batch size is set to 512 for retrieval tasks and 256 for other types of tasks. [...] In training, the maximum length for the query, passage, and example is set to 512. The example comprises the example query and example passage, each with a maximum length of 256. The maximum length for the concatenated query and examples is 2048.