reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PiCO: Peer Review in LLMs based on Consistency Optimization

Authors: Kun-Peng Ning, Shuo Yang, Yuyang Liu, Jia-Yu Yao, Zhenhui Liu, Yonghong Tian, Yibing Song, Yuan Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments on multiple datasets with standard rank-based metrics, validating the effectiveness of the proposed approach. We perform experiments on multiple crowdsourcing datasets with standard rank-based metrics, the results demonstrate that the proposed Pi CO framework can effectively obtain a large language models leaderboard closer to human preferences.
Researcher Affiliation	Academia	1School of Electrical and Computer Engineering, Peking University 2Peng Cheng Laboratory EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Overall Framework Algorithm of Peer Review
Open Source Code	Yes	Our code is released at https://github.com/PKU-YuanGroup/PiCO.
Open Datasets	Yes	To validate the effectiveness of the proposed approach, we perform experiments on Chatbot Arena (Zheng et al., 2023), MT-Bench (Zheng et al., 2023), and Alpaca Eval (Li et al., 2023b).
Dataset Splits	Yes	The ratios of response sets D are 1, 0.7, and 0.4, respectively.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Initialize model weights vector w with Gaussian distribution, In the framework of the Elo mechanism, as speciﬁed by Equation 16, the BASE value is set to 10, and the SCALE factor is determined to be 400., in the context of the Rank mechanism, as outlined by Equation 17, rank(j) signiﬁes the current ranking of model j, with the constant K assigned a value of 200., k is a hyper-parameter recommended to be set to 3 to 7, and we set k = 3 in this paper., it iteratively removes the lowest-scoring LLM from the reviewer queue for the next consistency optimization stage, until 60% of models are eliminated.