McHirc: A Multimodal Benchmark for Chinese Idiom Reading Comprehension

Authors: Tongguan Wang, Mingmin Wu, Guixin Su, Dongyu Su, Yuxue Hu, Zhongqiang Huang, Ying Sha

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The authoritativeness of MCh IRC and the effectiveness of DCIGN are demonstrated through a variety of experiments, which provides a new benchmark for the multimodal Chinese idiom reading comprehension task. Extensive experiments conducted on the MCh IRC dataset demonstrate the effectiveness of our proposed method, achieving an average accuracy of 73% in the four test sets.
Researcher Affiliation Academia 1Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, China 2Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, China 3Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China 4College of Informatics, Huazhong Agricultural University, Wuhan, China
Pseudocode No The paper describes the DCIGN model in detail with textual explanations and a diagram (Figure 3), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: "To address the above issues, we build, to the best of our knowledge, the first multimodal Chinese idiom reading comprehension dataset (MCh IRC1)." with footnote 1 pointing to "https://github.com/Aichiniuroumian/MCh IRC." This link is explicitly for the dataset, not the source code for the proposed DCIGN methodology.
Open Datasets Yes To address the above issues, we build, to the best of our knowledge, the first multimodal Chinese idiom reading comprehension dataset (MCh IRC1). We crawl numerous images from Baidu2 and Sogou3. After manual annotation, we collect 44,433 image-text pairs covering 2,926 idioms.1https://github.com/Aichiniuroumian/MCh IRC.
Dataset Splits Yes The remaining samples are divided into a training, validation, and test set in a 7: 1: 2 ratio using stratified sampling to ensure as much balance as possible between interclass and intraclass data. To further test the robustness of the models, we construct the Sim set following Zheng, Huang, and Sun (2019). The set of candidate idioms for the Sim set consists of the six idioms that are semantically most similar to the correct idiom. The final dividend dataset is shown in Table 2.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions using "a pre-trained BERT4 to extract textual features and a pre-trained Dei T5 to extract image features," providing HuggingFace links to these models. However, it does not specify the versions of other ancillary software or libraries (e.g., Python, PyTorch/TensorFlow, CUDA) used for implementing their methodology.
Experiment Setup Yes We perform a hyperparameter analysis of the α and β before Lccl and Lfcl, and the experimental results are shown in Table 6. We find that the average accuracy of the model is optimized when α=0.4, β=0.6.