reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Authors: Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O Arik

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate LARPO s effectiveness with 38.9 % and 13.7 % averaged improvement on Alpaca Eval2 and Mix Eval-Hard respectively.
Researcher Affiliation	Collaboration	1University of Illinois at Urbana-Champaign 2Google Cloud AI Research 3Google Deep Mind 4University of Virginia.
Pseudocode	Yes	Algorithm 1 LARPO: LLM alignment as iterative retriever preference optimization.
Open Source Code	No	The paper does not provide a specific repository link or an explicit statement about releasing the source code for the LARPO methodology. It only mentions 'The Alignment Handbook' as a general resource and refers to baseline checkpoints from other works.
Open Datasets	Yes	We conduct evaluation on two widely used benchmarks Alpaca Eval2 (Dubois et al., 2024) and Mix Eval (Ni et al., 2024). ... an experiment on the GSM8K dataset (Cobbe et al., 2021) ... an experiment on the NQ dataset (Kwiatkowski et al., 2019) ... trained on the Ultrafeedback dataset (Cui et al., 2024) ... trained on the meta-math dataset (Yu et al., 2023).
Dataset Splits	Yes	The Mathstral-7b-it model is trained on the GSM8k training set and evaluated its performance on the GSM8k test set. ... For DPO, we use the prompts in the training set of the two dataset and conduct online iterative preference optimization with the binary rule-based reward (measure if the final answer is correct or not with string match). The evaluation is performed on the test set of MATH and GSM8K respectively.
Hardware Specification	No	The paper does not provide specific hardware details (such as GPU or CPU models, or cloud computing instance types) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies or their version numbers (e.g., Python, PyTorch, or CUDA versions) used for the experiments.
Experiment Setup	Yes	The learning rate is set as 5e-7 and we train the LLM for 2 epochs per iteration. For the pairwise objective, we generate 2 responses for each prompt... Generation temperature is selected as 1 and 0.8 for Mistral-7b-base and Mistral-7b-it respectively (we search it among 0.8, 0.9, 1.0, 1.1, 1.2).