reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLMs can learn self-restraint through iterative self-reflection

Authors: Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Christopher Pal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our models on two tasks: i) Biographies and ii) Historical events. For each task, we extract a dataset from Wikipedia where each entity (person or historical event) is classified into a popularity tier (bottom, middle, top) according to the length of the corresponding article. Evaluation is done through atomic claim decomposition and evaluation with a larger LLM and a retriever that has access to the article. While our datasets and methodology are similar to FAct Score (Min et al., 2023), our datasets are larger (over 8000 entities per dataset) and signficiantly harder since they span the whole distribution of Wikipedia pages, while FAct Score is smaller (183 labeled entities and 500 unlabeled entities) and mostly focuses on entities which are referred by other Wikipedia pages.
Researcher Affiliation	Collaboration	Alexandre Piché EMAIL Service Now Research Aristides Milios Mila, Université de Montréal Dzmitry Bahdanau Service Now Research Mila, Mc Gill University Canada CIFAR AI Chair Chris Pal Service Now Research Mila, Polytechnique Montréal Canada CIFAR AI Chair
Pseudocode	Yes	Algorithm 1 Re Search algorithm. Require: Context dataset {xi}N i=1 Require: Policy πθ : X P(Y) Require: Q value Q(x, y) R Require: Claim likelihood p(T \|x, c, Y ) [0, 1] Require: Factuality threshold ρ [0, 1] Require: Claim splitter CS(y)
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	Tasks. We evaluate our models on two tasks: i) Biographies and ii) Historical events. For each task, we extract a dataset from Wikipedia where each entity (person or historical event) is classified into a popularity tier (bottom, middle, top) according to the length of the corresponding article. ... While our datasets and methodology are similar to FAct Score (Min et al., 2023)...
Dataset Splits	Yes	Experimental protocol. Each dataset is divided into a 7k train, 400 validation, 800 test set, and 30 invented entities to evaluate the model ability to abstain.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using "Llama2 70B" for evaluation, but this refers to a model, not the underlying hardware.
Software Dependencies	No	The paper mentions using "GTR-large Ni et al. (2021)" and the "spa Cy library" for NER, but does not specify version numbers for these or any other key software components.
Experiment Setup	Yes	Hyperparameters: The hyper-parameters for the training are available in Table 4. Table 4: Hyper-parameters Model Method gradient steps learning rate batch size regularization Llama2 SFT Re Search (Ours) 2000 5e-6 16 RLOO Re Search (Ours) 600 1e-6 512 0.1 DPO Re Search (Ours) 400 1e-6 64 0.25 search SFT 2000 5e-6 16 search RLOO 600 1e-6 512 0.1 search DPO 400 1e-6 64 0.25 Mistral 7b SFT Re Search (Ours) 2000 5e-6 8 RLOO Re Search (Ours) 600 1e-6 512 0.1 DPO Re Search (Ours) 400 1e-6 64 0.25 search SFT 2000 5e-6 8 search RLOO 600 1e-6 512 0.1 search DPO 400 1e-6 64 0.25