A General Approach to Multimodal Document Quality Assessment

Authors: Aili Shen, Bahar Salehi, Jianzhong Qi, Timothy Baldwin

JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our joint model achieves state-of-the-art results over five datasets in two domains (Wikipedia and academic papers), which demonstrates the complementarity of textual and visual features, and the general applicability of our model. To examine what kinds of features our model has learned, we further train our model in a multi-task learning setting, where document quality assessment is the primary task and feature learning is an auxiliary task. Experimental results show that visual embeddings are better at learning structural features while textual embeddings are better at learning readability scores, which further verifies the complementarity of visual and textual features.
Researcher Affiliation Academia Aili Shen EMAIL Bahar Salehi EMAIL Jianzhong Qi EMAIL Timothy Baldwin EMAIL School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia
Pseudocode No The paper describes methods like the Inception V3 model and bi-directional LSTM, and outlines their combination in Figure 2, but does not present them in the form of structured pseudocode or algorithm blocks. Descriptions are narrative and supported by diagrams.
Open Source Code Yes All code and data are available at https://github.com/Aili Aili/Multi Modal.
Open Datasets Yes All code and data are available at https://github.com/Aili Aili/Multi Modal.
Dataset Splits Yes We additionally randomly partitioned this dataset into training, development, and test splits based on a ratio of 8:1:1. Details of the dataset are presented in Table 1.
Hardware Specification No The paper describes software dependencies and experimental settings but does not specify any particular hardware used for running the experiments, such as CPU or GPU models.
Software Dependencies No The paper mentions software tools like NLTK, Image Magick, and Py PDF2, but does not provide specific version numbers for these or any other software dependencies crucial for replication.
Experiment Setup Yes We set the LSTM hidden layer size to 256. A dropout layer is applied at both the sentence and document level, respectively, with a probability of 0.5. For bi LSTM, we use a mini-batch size of 128 and a learning rate of 0.001. For both Inception and the Joint model, we use a mini-batch size of 16 and a learning rate of 0.0001. All hyper-parameters were set empirically over the development data, and the models are optimized using the Adam optimizer (Kingma & Ba, 2015). We train each model for 50 epochs. However, to prevent overfitting, we adopt early stopping, where we stop training the model if the performance on the development set does not improve for 20 epochs. For Inception, we adopt data augmentation techniques in the training with a nearest filling mode, a zoom range of 0.1, a width shift range of 0.1, and a height shift range of 0.1.