Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Human Simulacra: Benchmarking the Personification of Large Language Models

Authors: Qiujie Xie, Qiming Feng, Tianqi Zhang, Qingqiu Li, Linyi Yang, Yuejie Zhang, Rui Feng, Liang He, Shang Gao, Yue Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our constructed simulacra can produce personified responses that align with their target characters. Using the proposed benchmark, we empirically discuss the research question: How far are LLMs from replacing human subjects in psychological and sociological experiments? In this section, we introduce the empirical study involving 14 widely-used LLMs with 4 different auxiliary methods (None, Prompt, Retrieval Augmented Generation (RAG), and MACM) using 3 experimental settings (self reports, observer reports, and psychology experiment on conformity) on the Human Simulacra dataset.
Researcher Affiliation Collaboration 1School of Computer Science, Shanghai Key Lab of Intelligent Information Processing, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Fudan University 2Tongji University 3University College London 4Huawei Noah s Ark Lab 5Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention 6Deakin University 7Westlake University 8Westlake Institute for Advanced Study
Pseudocode Yes The complete process of dataset construction is outlined in Algorithm 1. Using this process, we created the virtual character dataset Human Simulacra, comprising about 129k texts across 11 virtual characters. In particular, we designed a virtual avatar for each character, as displayed in Figure 9.
Open Source Code Yes To ensure the reproducibility of our results, we have made detailed efforts throughout the paper. We provide comprehensive information about the dataset construction, virtual character design, and personality modeling in Section 3. Further details, including implementation specifics, simulation methods, and evaluation protocols, are available in Sections 3, 4 and the appendices. Our code and dataset are available at: https://github.com/hasaki Xie123/Human-Simulacra.
Open Datasets Yes To ensure the reproducibility of our results, we have made detailed efforts throughout the paper. We provide comprehensive information about the dataset construction, virtual character design, and personality modeling in Section 3. Further details, including implementation specifics, simulation methods, and evaluation protocols, are available in Sections 3, 4 and the appendices. Our code and dataset are available at: https://github.com/hasaki Xie123/Human-Simulacra.
Dataset Splits No The paper describes the creation of the Human Simulacra dataset (129k texts across 11 virtual characters) and its use in experiments (e.g., 18 trials for psychological replication). However, it does not specify traditional dataset splits like training/validation/test percentages or counts for model training.
Hardware Specification Yes Regarding hardware devices, all experiments in this paper are conducted on 8x3090 24GB GPUs.
Software Dependencies No The paper mentions using specific LLM models like 'GPT-3.5-Turbo model (Brown, 2020)' and 'GPT-4-Turbo' for data generation and simulation. However, it does not list any other software dependencies with specific version numbers (e.g., programming languages, libraries, or frameworks) required to replicate the experimental setup.
Experiment Setup Yes We employ the GPT-3.5-Turbo model (Brown, 2020) as the data generator with a frequency_penalty of 1.0 and top_p of 0.95. To evaluate the human simulation ability of different LLMs, we experiment with 14 mainstream LLM-based simulacra with 4 different auxiliary methods (None, Prompt, Retrieval Augmented Generation (RAG), and MACM) using 3 experimental settings (self reports, observer reports, and psychology experiment on conformity). Following (Asch, 1956; 2016), we arrange 18 trials for the simulacra in the psychological experiment replication.