reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CharacterBench: Benchmarking Character Customization of Large Language Models

Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark s potential to optimize LLMs character customization. Extensive experiments conducted with our developed Character Judge show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark s potential to optimize LLMs character customization.
Researcher Affiliation	Collaboration	1The Co AI Group, DCST, Tsinghua University 2Lingxin AI 3Fuxi AI Lab, Netease
Pseudocode	No	The paper describes methods and processes in narrative text and mathematical formulas, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that source code for its methodology is released, nor does it provide a link to a code repository.
Open Datasets	No	The paper introduces CHARACTERBENCH as a new benchmark and describes its creation and statistics, stating it includes '22,859 human-annotated samples', but it does not provide any direct link, DOI, or repository for public access to this dataset.
Dataset Splits	Yes	We split the data into training and test sets to develop our Character Judge model for evaluating LLMs character customization. The test set is further divided into In-domain and Out-of-domain sets, each domain containing 125 samples from each dimension. Table 2 provides the exact sample counts: Training Set (19,609), Test Set (3,250), Test Set (In-domain) (1,625), Test Set (Out-of-domain) (1,625).
Hardware Specification	No	The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory) used for running its experiments or training the Character Judge model.
Software Dependencies	No	The paper mentions using specific LLM models (e.g., 'GPT-4o' for translation, 'Qwen2-7B-Chat' for Character Judge), but it does not specify the version numbers of ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other tools used in their experimental setup.
Experiment Setup	No	The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings for training the Character Judge model or other components of their methodology.