reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension

Authors: Tongguan Wang, Mingmin Wu, Guixin Su, Dongyu Su, Yuxue Hu, Zhongqiang Huang, Ying Sha

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The authoritativeness of MCh IRC and the effectiveness of DCIGN are demonstrated through a variety of experiments, which provides a new benchmark for the multimodal Chinese idiom reading comprehension task. Extensive experiments conducted on the MCh IRC dataset demonstrate the effectiveness of our proposed method, achieving an average accuracy of 73% in the four test sets.
Researcher Affiliation	Academia	1Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China 2Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China 3Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China 4College of Informatics, Huazhong Agricultural University, Wuhan, China
Pseudocode	No	The paper describes the DCIGN model in detail with textual explanations and a diagram (Figure 3), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "To address the above issues, we build, to the best of our knowledge, the first multimodal Chinese idiom reading comprehension dataset (MCh IRC1)." with footnote 1 pointing to "https://github.com/Aichiniuroumian/MCh IRC." This link is explicitly for the dataset, not the source code for the proposed DCIGN methodology.
Open Datasets	Yes	To address the above issues, we build, to the best of our knowledge, the first multimodal Chinese idiom reading comprehension dataset (MCh IRC1). We crawl numerous images from Baidu2 and Sogou3. After manual annotation, we collect 44,433 image-text pairs covering 2,926 idioms.1https://github.com/Aichiniuroumian/MCh IRC.
Dataset Splits	Yes	The remaining samples are divided into a training, validation, and test set in a 7: 1: 2 ratio using stratified sampling to ensure as much balance as possible between interclass and intraclass data. To further test the robustness of the models, we construct the Sim set following Zheng, Huang, and Sun (2019). The set of candidate idioms for the Sim set consists of the six idioms that are semantically most similar to the correct idiom. The final dividend dataset is shown in Table 2.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions using "a pre-trained BERT4 to extract textual features and a pre-trained Dei T5 to extract image features," providing HuggingFace links to these models. However, it does not specify the versions of other ancillary software or libraries (e.g., Python, PyTorch/TensorFlow, CUDA) used for implementing their methodology.
Experiment Setup	Yes	We perform a hyperparameter analysis of the α and β before Lccl and Lfcl, and the experimental results are shown in Table 6. We find that the average accuracy of the model is optimized when α=0.4, β=0.6.