reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

KBLaM: Knowledge Base augmented Language Model

Authors: Xi Wang, Taketomo Isazawa, Liana Mikaelyan, James Hensman

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate KBLAM s effectiveness in various tasks, including question-answering and open-ended reasoning, while providing interpretable insights into its use of the augmented knowledge. In this section, we perform empirical evaluation for KBLAM.
Researcher Affiliation	Collaboration	Xi Wang Johns Hopkins University EMAIL Taketomo Isazawa* Microsoft Research EMAIL Liana Mikaelyan Microsoft EMAIL James Hensman Microsoft Research EMAIL
Pseudocode	No	The paper describes methods and processes verbally and with diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code and datasets are available at https://github.com/microsoft/KBLa M/
Open Datasets	Yes	Code and datasets are available at https://github.com/microsoft/KBLa M/ Lastly, we release our training and evaluation KBs, which can help future research in augmenting LLM with KB, as well as in other topics such as long-context language models, hallucination detection/reduction, and structured attention. Enron A KB constructed from the Enron (Klimt & Yang, 2004) dataset, an open-sourced corporate email dataset.
Dataset Splits	Yes	To construct each training sample, we perform the following procedure: We randomly select a subset of 10 to 100 triples from the synthetic KB to form a sample-specific KB. For evaluation, we considered the following two KB datasets3 Synthetic data The validation set of the synthetic KB, i.e. the 15000 triples not used for training. We consider a setting where, given a KB, we ask the model 100 questions in total, out of which 80 questions are answerable, and the other 20 are not.
Hardware Specification	Yes	The instruction tuning is performed on a single 80GB A100 GPU under bfloat16 without any parameter-efficient tuning methods.
Software Dependencies	No	For all experiments, we use the instruction fine-tuned version of Llama3 8B (Dubey et al., 2024) as the backbone LLM, and Open AI s ada-002 sentence embedding model (P = 1536) as the pre-trained encoder for computing base key and value embedding (Eq. (5)). The paper mentions specific models used (Llama3 8B, Open AI's ada-002) but does not provide specific version numbers for software libraries or programming languages required for replication.
Experiment Setup	Yes	Optimization is conducted using Adam W (Loshchilov, 2017) with a step size of 5 10 4 and a cosine learning rate decay to 5 10 6 for 20K iterations. Each iteration uses a mini-batch of 400 Q&A pairs, composed of 20 micro-batches of 20 samples.