LLMs can learn self-restraint through iterative self-reflection

Authors: Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Christopher Pal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our models on two tasks: i) Biographies and ii) Historical events. For each task, we extract a dataset from Wikipedia where each entity (person or historical event) is classified into a popularity tier (bottom, middle, top) according to the length of the corresponding article. Evaluation is done through atomic claim decomposition and evaluation with a larger LLM and a retriever that has access to the article. While our datasets and methodology are similar to FAct Score (Min et al., 2023), our datasets are larger (over 8000 entities per dataset) and signficiantly harder since they span the whole distribution of Wikipedia pages, while FAct Score is smaller (183 labeled entities and 500 unlabeled entities) and mostly focuses on entities which are referred by other Wikipedia pages.
Researcher Affiliation Collaboration Alexandre Piché EMAIL Service Now Research Aristides Milios Mila, Université de Montréal Dzmitry Bahdanau Service Now Research Mila, Mc Gill University Canada CIFAR AI Chair Chris Pal Service Now Research Mila, Polytechnique Montréal Canada CIFAR AI Chair
Pseudocode Yes Algorithm 1 Re Search algorithm. Require: Context dataset {xi}N i=1 Require: Policy πθ : X P(Y) Require: Q value Q(x, y) R Require: Claim likelihood p(T |x, c, Y ) [0, 1] Require: Factuality threshold ρ [0, 1] Require: Claim splitter CS(y)
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Tasks. We evaluate our models on two tasks: i) Biographies and ii) Historical events. For each task, we extract a dataset from Wikipedia where each entity (person or historical event) is classified into a popularity tier (bottom, middle, top) according to the length of the corresponding article. ... While our datasets and methodology are similar to FAct Score (Min et al., 2023)...
Dataset Splits Yes Experimental protocol. Each dataset is divided into a 7k train, 400 validation, 800 test set, and 30 invented entities to evaluate the model ability to abstain.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using "Llama2 70B" for evaluation, but this refers to a model, not the underlying hardware.
Software Dependencies No The paper mentions using "GTR-large Ni et al. (2021)" and the "spa Cy library" for NER, but does not specify version numbers for these or any other key software components.
Experiment Setup Yes Hyperparameters: The hyper-parameters for the training are available in Table 4. Table 4: Hyper-parameters Model Method gradient steps learning rate batch size regularization Llama2 SFT Re Search (Ours) 2000 5e-6 16 RLOO Re Search (Ours) 600 1e-6 512 0.1 DPO Re Search (Ours) 400 1e-6 64 0.25 search SFT 2000 5e-6 16 search RLOO 600 1e-6 512 0.1 search DPO 400 1e-6 64 0.25 Mistral 7b SFT Re Search (Ours) 2000 5e-6 8 RLOO Re Search (Ours) 600 1e-6 512 0.1 DPO Re Search (Ours) 400 1e-6 64 0.25 search SFT 2000 5e-6 8 search RLOO 600 1e-6 512 0.1 search DPO 400 1e-6 64 0.25