CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-Tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the value of the COSER dataset for RPLA training, evaluation and retrieval. Moreover, COSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the In Character and Life Choice benchmarks respectively.
Researcher Affiliation Collaboration 1Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University 2Step Fun 3School of Data Science, Fudan University 4Johns Hopkins University.
Pseudocode No The paper describes methods like 'LLM-based pipeline' and 'given-circumstance acting' but does not include any explicit pseudocode blocks or algorithms with structured steps.
Open Source Code Yes Our code, dataset and models are available at: https://github.com/Neph0s/Co SER.
Open Datasets Yes The COSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. ... Our code, dataset and models are available at: https://github.com/Neph0s/Co SER.
Dataset Splits Yes We train COSER 8B and COSER 70B... using 90% books in our dataset. For evaluation purposes, we held out the last 10% of data from each book; that is, they are not included in our prompts or datasets for training or retrieval purposes. Additionally, we trained the COSER models on only 90% of the books. The COSER Test set contains 200 samples: 100 from books used in COSER training (in-domain) and 100 from books not used in training (out-of-domain).
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments or train its models, beyond mentioning the underlying models used (LLaMA 3.1) or LLM critics (GPT-4o).
Software Dependencies Yes In this paper, we employs Claude-3.5-Sonnet (20240620).
Experiment Setup Yes We fine-tune the LLaMA 3.1 models using the following hyperparameters: a learning rate of 1e-5, a sequence length of 16,384, training for 8 epochs, and a global batch size of 48.