reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WenyanGPT: A Large Language Model for Classical Chinese Tasks

Authors: Xinyu Yao, Mengdi Wang, Bo Chen, Xiaobing Zhao

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on Wenyan BENCH demonstrate that Wenyan GPT significantly outperforms current advanced LLMs in various Classical Chinese tasks.
Researcher Affiliation	Academia	1School of Information Engineering, Minzu University of China 2National Language Resource Monitoring and Research Center of Minority Languages EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph text and through diagrams (Figure 2: Overall Training Framework of Wenyan GPT, Figure 3: Instruction Fine-Tuning Data Construction Process), but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The abstract states: "We make the model s training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available..." but does not explicitly state that the source code for Wenyan GPT itself is publicly available. A GitHub link is provided in a footnote for the baseline model Xunzi, not for Wenyan GPT.
Open Datasets	Yes	We make the model s training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
Dataset Splits	Yes	We make the model s training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing. In order to evaluate the model s performance on Classical Chinese tasks, we devise a benchmark known as Wenyan Bench.
Hardware Specification	No	The paper mentions using the LLaMA3-8B-Chinese model and training efficiency with bfloat16 data format, but does not provide specific details about the hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	The paper mentions using LLaMA3-8B-Chinese as the base model but does not specify any software dependencies with version numbers (e.g., specific Python, PyTorch, or CUDA versions).
Experiment Setup	Yes	The hyper-parameter settings in pre-training are shown in Table 2. Hyper parameter Value per device train batch size 16 gradient accumulation steps 1 learning rate 1.0e-4 num train epochs 1 lr scheduler type cosine warmup ratio 0.1 The hyper-parameter settings in fine-tuning are shown in Table 4. Hyper parameter Value per device train batch size 8 gradient accumulation steps 2 learning rate 1.0e-4 num train epochs 1 lr scheduler type cosine warmup ratio 0.1