reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SPRI: Aligning Large Language Models with Context-Situated Principles

Authors: Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We evaluate SPRI in three situations: (1) We consider a domain-specific task where expert-level complex principles were shown to be necessary: having LLMs produce cognitive reappraisals ( 4.1). (2) Evaluation of open-ended generations across complex tasks with LLM judges. (3) Generating synthetic data with SPRI proves effective for fine-tuning base LLMs, resulting in substantial improvement on Truthful QA (Lin et al., 2022), whilst maintaining performance on other benchmarks ( 5).
Researcher Affiliation	Collaboration	1Department of Linguistics, The University of Texas at Austin, Austin, TX, USA 2IBM Research, Yorktown Heights, NY, USA 3MIT-IBM Watson AI Lab, Cambridge, MA, USA.
Pseudocode	Yes	A. Pseudo-code for SPRI Algorithm 1 Pseudo-code for SPRI Require: user input T, base language model M, critic language model C, seed examples S (optional), prompts {Pprinciple-gen, Pprinciple-refine, Presponse-gen, Presponse-refine}, evaluation prompts {Evalprinciple, Evalresponse}, max iterations nmax, desired score threshold τ.
Open Source Code	Yes	We release our code and model generations at https: //github.com/honglizhan/SPRI-public.
Open Datasets	Yes	We evaluate on the same dataset from Zhan et al. (2024). The data is sourced from Reddit posts seeking emotional support... We utilize Bi GGen Bench (Kim et al., 2025), an extensive benchmark... We evaluate the performance of fine-tuned models on several benchmarks, namely Truthful QA (Lin et al., 2022), MUSR (Sprague et al., 2024), GPQA (Rein et al., 2024), BBH (Suzgun et al., 2023), MMLU-Pro (Wang et al., 2024), and Hellaswag (Zellers et al., 2019).
Dataset Splits	Yes	We randomly split Dolly into a 10k/2k split for training and validation. For Mix Instruct, we randomly select 50k examples from its training set and 2k examples from its validations set.
Hardware Specification	Yes	All our fine-tuning experiments are carried out on 3 NVIDIA A100 40GB GPUs.
Software Dependencies	No	The paper mentions fine-tuning with Lo RA (Hu et al., 2022) and using the Alpaca format template (Taori et al., 2023), but it does not specify version numbers for programming languages, libraries, or other software dependencies.
Experiment Setup	Yes	We set the temperature T = 0.7 for model inferencing. We set the temperature value for all model generations to 0.7, top k to 50, top p to 0.95. We also restrict the maximum tokens of generation to 256. This iterative critique-refinement process continues until the principles receive a desired score of at least 4 or a maximum of four iterations is reached. We finetune with Lo RA (Hu et al., 2022), and we compute the loss on responses only. For base (i.e., non-instruction-tuned) models, we use the Alpaca format template (Taori et al., 2023) for training; for instruction-tuned models, we fine-tune them on their own chat templates. We save the best model checkpoint at validation loss as the final model.