reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts

Authors: Lihu Chen, Adam Dejl, Francesca Toni

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate that our method outperforms baseline methods significantly. More importantly, analysis of neuron distributions reveals the presence of visible localized regions, particularly within different subject domains. Finally, we show potential applications of our detected neurons in knowledge editing and neuron-based prediction.
Researcher Affiliation	Academia	Imperial College, London, UK
Pseudocode	No	The paper describes a framework and its components using mathematical formulations (e.g., equations 1-7) and step-by-step descriptions in text, but it does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/tigerchen52/qrneuron
Open Datasets	Yes	Domain Dataset is derived from MMLU (Hendrycks et al. 2020), a multi-choice QA benchmark designed to evaluate models across a wide array of subjects with varying difficulty levels. [...] Language Dataset is adapted from Multilingual LAMA (Kassner, Dufter, and Sch utze 2021), which is a dataset to investigate knowledge in language models in a multilingual setting covering 53 languages.
Dataset Splits	Yes	Domain Dataset is derived from MMLU (Hendrycks et al. 2020), a multi-choice QA benchmark designed to evaluate models across a wide array of subjects with varying difficulty levels. The subjects encompass traditional disciplines such as mathematics and history, as well as specialized fields like law and ethics. In our study, we select six high school exam subjects from the test set: Biology, Physics, Chemistry, Mathematics, Computer Science, and Geography. [...] Hyper-parameters are selected based on a hold-out set of biology queries with 50 samples.
Hardware Specification	Yes	We ran all experiments on three NVIDIA-V100.
Software Dependencies	No	The paper mentions using Llama-2-7B and Mistral-7B models, but does not provide specific software dependency versions (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	As for the hyper-parameters, the number of estimation steps was set to m=16 and the attribution threshold t to 0.2 times the maximum attribution score. The template number was \|Q\| = 3, the frequency u for obtaining common neurons was 30%, and the top-v for select coarse neurons was 20. We ran all experiments on three NVIDIA-V100. It took 120 seconds on average to locate neurons for a query with three prompt templates.