reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-Tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the value of the COSER dataset for RPLA training, evaluation and retrieval. Moreover, COSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the In Character and Life Choice benchmarks respectively.
Researcher Affiliation	Collaboration	1Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University 2Step Fun 3School of Data Science, Fudan University 4Johns Hopkins University.
Pseudocode	No	The paper describes methods like 'LLM-based pipeline' and 'given-circumstance acting' but does not include any explicit pseudocode blocks or algorithms with structured steps.
Open Source Code	Yes	Our code, dataset and models are available at: https://github.com/Neph0s/Co SER.
Open Datasets	Yes	The COSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. ... Our code, dataset and models are available at: https://github.com/Neph0s/Co SER.
Dataset Splits	Yes	We train COSER 8B and COSER 70B... using 90% books in our dataset. For evaluation purposes, we held out the last 10% of data from each book; that is, they are not included in our prompts or datasets for training or retrieval purposes. Additionally, we trained the COSER models on only 90% of the books. The COSER Test set contains 200 samples: 100 from books used in COSER training (in-domain) and 100 from books not used in training (out-of-domain).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments or train its models, beyond mentioning the underlying models used (LLaMA 3.1) or LLM critics (GPT-4o).
Software Dependencies	Yes	In this paper, we employs Claude-3.5-Sonnet (20240620).
Experiment Setup	Yes	We fine-tune the LLaMA 3.1 models using the following hyperparameters: a learning rate of 1e-5, a sequence length of 16,384, training for 8 epochs, and a global batch size of 48.