reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Authors: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya K Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan S Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia moakhar, Ayush Tarun, Azmine Toushik Wasi, Thenuka Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we sample INCLUDE into two subsets for different evaluation budgets and assess an array of closed and open models on these partitions. Our results demonstrate that current models achieve high variance in performance between different languages in INCLUDE, and that models often struggle with questions requiring regional knowledge.
Researcher Affiliation	Collaboration	1EPFL, 2Cohere For AI, 3Cohere For AI Community, 4ETH Zurich, 5Saarland University, 6University of Toronto, 7Service Now, 8Georgia Institute of Technology
Pseudocode	No	The paper describes a data collection procedure in text form and an experimental setup, but it does not include any explicitly labeled pseudocode or algorithm blocks. It outlines processes like data extraction and quality control in prose rather than structured steps.
Open Source Code	Yes	We release two subsets, INCLUDE-BASE6 and INCLUDE-LITE7, alongside the associated documentation and code for data processing and evaluation. These resources will be made publicly available upon acceptance.
Open Datasets	Yes	Our novel resource, INCLUDE,1 is a comprehensive knowledgeand reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. 1https://huggingface.co/datasets/Cohere For AI/include-base-44
Dataset Splits	No	The paper describes the creation of two subsets, INCLUDE-BASE and INCLUDE-LITE, with specific sampling strategies and sample limits per language. However, it does not provide traditional training/test/validation splits for models, as the dataset is designed as an evaluation benchmark. Models are evaluated using 5-shot or zero-shot prompting on these subsets, which are used as evaluation sets rather than for model training/validation splits.
Hardware Specification	Yes	Each model was evaluated using a single A100 GPU (80GB memory), with evaluation times averaging approximately 4 hours for INCLUDE-BASE.
Software Dependencies	No	The paper mentions models used and some evaluation parameters (e.g., decoding temperature, generation lengths) but does not specify software versions for any key libraries, frameworks, or programming languages (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Following Hendrycks et al. (2020), we report both 5-shot and zero-shot scores. For the zero-shot setting, we employ a Chain-of-Thought (Co T; Wei et al., 2022) approach by appending the translation of let s think step by step to the prompt (Kojima et al., 2022). The maximum generation lengths for the 5-shot and zero-shot Co T configurations are set to 512 and 1024 tokens.