reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Authors: Mohamad Fares El Hajj Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate SITAlign s performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-Safe RLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
Researcher Affiliation	Academia	1University of Texas at Austin, Austin, TX, USA 2University of Maryland, College Park, MD, USA 3University of Central Florida, Orlando, FL, USA.
Pseudocode	Yes	Algorithm 1 SITAlign: Satisficing Inference-Time Alignment of Large Language Models
Open Source Code	No	The paper states: "Reproducibility is ensured through the use of publicly available resources." However, it does not provide a specific link or explicit statement that their source code is open-sourced or available in supplementary materials.
Open Datasets	Yes	For this task, we utilize the PKU-Safe RLHF dataset (Ji et al., 2024). This task is based on the Anthropic-HH dataset (Bai et al., 2022b). For this task, we utilize the Summarize-from-Feedback dataset (Stiennon et al., 2020b).
Dataset Splits	No	The paper mentions using "the test set of PKU-Safe RLHF dataset" but does not specify the exact split percentages or sample counts for training, validation, or test sets for any of the datasets used.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using GPT-4 for evaluation and refers to models like Zephyr-7B-β, Skywork-Reward-Llama-3.1-8B-v0.2, MPT-7B-Chat, and Minotaur-7B. However, it does not specify version numbers for any ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For implementation, we set the number of tokens sampled (top-k) k = 10 and the decoding alignment parameter α = 1.