reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study

Authors: Nan Zhuang, Boyu Cao, Yi Yang, Jing Xu, Mingda Xu, Yuxiao Wang, Qi Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments across 19 open/unopen-source LLM models in five scenarios at maximum, employing both memory-based and evaluation-based tasks adapted and redesigned from human cognitive studies. Our findings show that LLM agents may exhibit biased attribution or evaluation that supports their initial choices, and such bias may persist even if contextual hallucination is not observable. Our extensive study involving 284 well-educated humans shows that, despite bias, certain LLM agents can still perform better than humans in similar evaluation tasks.
Researcher Affiliation	Academia	1School of Software Technology, Zhejiang University, Hangzhou, China 2School of Future Technology, South China University of Technology, Guangzhou, China 3Faculty of Humanities and Arts, Macau University of Science and Technology, Macau
Pseudocode	No	The paper describes the experimental procedures and methodologies in detail, such as the 'Memory-based Experiments' and 'Evaluation-based Experiments' sections, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using 'open/unopen-source LLM models' and refers to an 'Extended version & Appendix https://t.cn/A6mg7Xel', but it does not explicitly state that the authors' own implementation code or methodology is being released as open-source, nor does it provide a direct link to a code repository.
Open Datasets	No	The paper adapts experimental paradigms and options from cited works (e.g., 'Henkel and Mather 2007; Lind et al. 2017') and states 'We recruit 301 participants' for a human study, and 'We generate a diverse set of decision-making scenarios' for LLM experiments. However, it does not provide concrete access information (like a specific link, DOI, or repository) for the specific datasets or generated scenarios used in their experiments, nor does it explicitly state they are publicly available.
Dataset Splits	No	The paper describes randomization procedures for options and features within each experiment (e.g., 'the order of each option and its features are shuffled randomly' and 'randomizing the assignment of positive features, negative features, and neutral features to the two options in each iteration'). However, it does not specify traditional training, validation, and test dataset splits for model training, as the LLMs are the subjects being evaluated, not models being trained by the authors on a fixed dataset.
Hardware Specification	No	The paper mentions conducting experiments across various LLM models and using a GPT-4o agent for analysis. However, it does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used to run these experiments or analyses.
Software Dependencies	No	The paper refers to using various 'LLM models' and a 'GPT-4o agent' for evaluation, but it does not list any specific software dependencies or libraries with their corresponding version numbers that are necessary to replicate the experimental setup.
Experiment Setup	Yes	For each round of the experiment, the order of each option and its features are shuffled randomly, and the temperature is set to zero to ensure reproducibility. We conducted 50 iterations of the experiment for each model. The decision-making agent is presented with two options, each described by a set of shuffled features/characters: three positive, three negative, and three neutral. The analysis is conducted by a GPT-4o agent, which evaluates the degree of implicit support for the chosen option in the Evaluation agent s responses. This agent uses a scoring system ranging from -5 to 5, which we termed the Tendency Score.