reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMTEB: Massive Multilingual Text Embedding Benchmark

Authors: Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Harries, Loïc Magne, Isabelle Mohr, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Suppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Mariya Hendriksen, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lu, Jordan Clive, Gayatri K, Maksimova Anna, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James V. Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters.
Researcher Affiliation	Collaboration	1Aarhus University, 2Individual Contributor, 3Esker, 4INSA Lyon, LIRIS, 5University of Amsterdam, 6MBZUAI, 7Jina AI, 8Microsoft Research, 9Wikit, 10Mc Gill University, 11University of Oxford, 12ITMO University, 13Koç University, 14Heritage Institute of Technology, 15Apart Research, 16BAAI, 17National Information Processing Institute, 18New York University, 19Ellamind, 20Peking University, 21Centrale Supélec, 22Artefact Research Center, 23Hugging Face, 24Wrocław University 25Korea University, 26Illuin Technology, 27Comenius University Bratislava, 28Cisco Systems, 29University of Waterloo, 30Cohere For AI, 31University of Zurich, 32Stanford University, 33FRC CSC RAS, 34Salesforce, 35IIT Madras, 36Sapienza University of Rome, 37University of Pennsylvania, 38Salute Devices, 39Princeton University, 40University of Washington, 41Imperial College London, 42R. V. College of Engineering, 43Robert Koch Institute, 44HSE University, 45Nirma University, 46Occiglot, 47Allen Institute for AI, 48Tano Labs, 49The London Institute of Banking and Finance, 50Cornell University, 51Northeastern University, 52Hong Kong University 53Durham University, 54Service Now Research, 55Johns Hopkins University, 56ELLIS Institute Tübingen 57MPI-IS Tübingen 58Contextual AI
Pseudocode	No	The paper describes methods and processes like 'backward selection method' and 'bootstrapping approach' in narrative text, but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	MMTEB comes with open-source code available at https://github.com/embeddings-benchmark/mteb and a public leaderboard available at https://huggingface.co/spaces/mteb/leaderboard.
Open Datasets	Yes	MMTEB, the Massive Multilingual Text Embedding Benchmark, which comprises more than 500 distinct tasks across 10 task categories, covering over 250 languages... For an overview see Figure 1. 1MMTEB comes with open-source code available at https://github.com/embeddings-benchmark/mteb and a public leaderboard available at https://huggingface.co/spaces/mteb/leaderboard.
Dataset Splits	Yes	Classification First, a train set is constructed by sampling n (8-16) samples for each label. If only a test set is available, a section is split off as a training set. Both sets are then embedded and used to train a logistic regression using a maximum of 100 iterations. Afterwards, performance metrics are calculated. For robustness, this process is repeated 10 times.
Hardware Specification	Yes	It significantly reduces computational cost (3.11 hours on an H100 GPU for a 7B model) by using only 2% of the original documents (6% of the original number of characters) while maintaining sensitivity as a benchmark to rank models accurately.
Software Dependencies	No	The paper mentions using 'codecarbon (Courty et al., 2024)' for emissions tracking and provides revision IDs for models used, but does not explicitly list specific software libraries or frameworks with their version numbers that would be necessary to replicate the overall methodology or experimental setup.
Experiment Setup	Yes	Classification First, a train set is constructed by sampling n (8-16) samples for each label. If only a test set is available, a section is split off as a training set. Both sets are then embedded and used to train a logistic regression using a maximum of 100 iterations. Afterwards, performance metrics are calculated.