CharacterBench: Benchmarking Character Customization of Large Language Models

Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark s potential to optimize LLMs character customization. Extensive experiments conducted with our developed Character Judge show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark s potential to optimize LLMs character customization.
Researcher Affiliation Collaboration 1The Co AI Group, DCST, Tsinghua University 2Lingxin AI 3Fuxi AI Lab, Netease
Pseudocode No The paper describes methods and processes in narrative text and mathematical formulas, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for its methodology is released, nor does it provide a link to a code repository.
Open Datasets No The paper introduces CHARACTERBENCH as a new benchmark and describes its creation and statistics, stating it includes '22,859 human-annotated samples', but it does not provide any direct link, DOI, or repository for public access to this dataset.
Dataset Splits Yes We split the data into training and test sets to develop our Character Judge model for evaluating LLMs character customization. The test set is further divided into In-domain and Out-of-domain sets, each domain containing 125 samples from each dimension. Table 2 provides the exact sample counts: Training Set (19,609), Test Set (3,250), Test Set (In-domain) (1,625), Test Set (Out-of-domain) (1,625).
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory) used for running its experiments or training the Character Judge model.
Software Dependencies No The paper mentions using specific LLM models (e.g., 'GPT-4o' for translation, 'Qwen2-7B-Chat' for Character Judge), but it does not specify the version numbers of ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other tools used in their experimental setup.
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings for training the Character Judge model or other components of their methodology.