reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Qualifying Knowledge and Knowledge Sharing in Multilingual Models

Authors: Nicolas Guerin, Ryan M. Nefdt, Emmanuel Chemla

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we disentangle the multifaceted nature of knowledge: successfully completing a knowledge retrieval task (e.g., The capital of France is __ ) involves mastering underlying concepts (e.g., France, Paris), relationships between these concepts (e.g., capital of ) and the structure of prompts, including the language of the query. We propose to disentangle these distinct aspects of knowledge and apply this typology to offer a critical view of neuron-level knowledge attribution techniques. For concreteness, we focus on Dai et al. s (2022) Knowledge Neurons (KNs) across multiple PLMs (BERT, OPT, Llama and Gemma), testing 10 natural languages and additional unnatural languages (e.g. Autoprompt). Our key contributions are twofold: (i) we show that KNs come in different flavors, some indeed encoding entity level concepts, some having a much less transparent, more polysemantic role , and (ii) we address the problem of cross-linguistic knowledge sharing at the neuron level, more specifically we uncover an unprecedented overlap in KNs across up to all of the 10 languages we tested, pointing to the existence of a partially unified, language-agnostic retrieval system. To do so, we introduce and release the Multi-Para Rel dataset, an extension of Para Rel, featuring prompts and paraphrases for cloze-style knowledge retrieval tasks in parallel over 10 languages.
Researcher Affiliation	Academia	Nicolas Guerin EMAIL Laboratoire de Sciences Cognitives et Psycholinguistique Dept d Etudes Cognitives, ENS, PSL University, EHESS, CNRS Ryan Nefdt EMAIL University of Cape Town University of Bristol Emmanuel Chemla EMAIL Laboratoire de Sciences Cognitives et Psycholinguistique Dept d Etudes Cognitives, ENS, PSL University, EHESS, CNRS Earth Species Project
Pseudocode	No	The paper describes methods and procedures in narrative text and refers to figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper states: 'We release Multi-Para Rel, a multilingual version of the Para Rel dataset (Elazar et al., 2021a), which includes 10 languages and is compatible with autoregressive models. The dataset is available here1.' and '1The dataset is available at https://github.com/Gp Nico/multi-pararel'. This refers to the release of a dataset, not the source code for the methodology or experiments described in the paper.
Open Datasets	Yes	We introduce and release the Multi-Para Rel dataset, an extension of Para Rel, featuring prompts and paraphrases for cloze-style knowledge retrieval tasks in parallel over 10 languages. ... We release Multi-Para Rel, a multilingual version of the Para Rel dataset (Elazar et al., 2021a), which includes 10 languages and is compatible with autoregressive models. The dataset is available here1. ... 1The dataset is available at https://github.com/Gp Nico/multi-pararel. ... For relational facts, we used the TREx dataset (Elsahar et al., 2018) ... We also used m LAMA, which contains triplets for over 53 languages.
Dataset Splits	Yes	Following the same train, development, and test splits as Shin et al. (2020), we trained 10 different seeds of Auto Prompt for each relation and each model.
Hardware Specification	Yes	KNs computations were performed on NVIDIA Tesla V100 GPUs for models with less than a billion parameters, and on NVIDIA Tesla A100 GPUs for larger models. ... For this experiment we studied bert-base-multilingual-uncased (Devlin et al., 2019) and Llama-2-7b. We used a NVIDIA Tesla V100 GPU for BERT and NVIDIA Tesla A100 GPU for Llama 2, both for about one hour per relation and per language. ... As a translation model, we used Meta s Seamless M4T and, more specifically, the Huggingface implementation3. We used an NVIDIA Tesla V100 GPU for inference.
Software Dependencies	No	For all these models we use the Hugging Face implementation. ... As a translation model, we used Meta s Seamless M4T and, more specifically, the Huggingface implementation3. ... While software components like 'Hugging Face implementation' and 'Meta s Seamless M4T' are mentioned, specific version numbers for these or other libraries (e.g., Python, PyTorch) are not provided.
Experiment Setup	Yes	First, they retain only neurons with an attribution score greater than tkn maxi,l Attrh,pr,t(w(l) i ).This procedure is carried out for each prompt associated with a fact < h, r, t >, and thus yields a set of candidate KNs per prompt. Let us denote Nr the number of prompts for a given relation r. To get results robust to noise, and to factor out signal associated to specific prompts rather than knowledge, they keep only neurons appearing in the candidate neurons set of at least pkn Nr prompts. They propose thresholds of tkn = 0.2 (only keep neurons scoring at least at 20% of the max attribution score) and pkn = 0.7 (only keep neurons appearing in at least 70% of the different prompts for a given relation). ... We define Relation Neurons as KNs that appear in at least tr N instances of facts associated with a particular relation, where N is the total number of facts, and tr is a predefined relational threshold. In contrast, neurons that appear in less than tc N of the facts, for some other threshold tc, are referred to as Concept Neurons...