reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Adaptation for Visual Document Understanding

Authors: Sayna Ebrahimi, Sercan O Arik, Tomas Pfister

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Doc TTA shows signiﬁcant improvements on these compared to the source model performance, up to 1.89% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively. Our benchmark datasets are available at https://saynaebrahimi.github.io/Doc TTA.html. ... In this paper, we propose Doc TTA, a novel TTA method for VDU that utilizes self-supervised learning on text and layout modalities using masked visual language modeling (MVLM) while jointly optimizing with pseudo labeling. We introduce an uncertainty-aware per-batch pseudo labeling selection mechanism, which makes more accurate predictions compared to the commonly-used pseudo labeling techniques in CV that use no pseudo-labeling selection mechanism (Liang et al., 2020) in TTA or select pseudo labels based on both uncertainty and conﬁdence (Rizve et al., 2021) in semi-supervised learning settings. ... We show Doc TTA signiﬁcantly improves source model performance at test-time on all VDU tasks without any supervision. ... For entity recognition and key-value extraction tasks, we use entity-level F1 score as the evaluation metric, whereas for the document VQA task, we use Average Normalized Levenshtein Similarity (ANLS) introduced by (Biten et al., 2019) ... Ablation studies: We compare the impact of diﬀerent constituents of our methods on the Doc VQA-TTA benchmark, using a model trained on Emails&Letters domain and adapted to other three domains. Table 3 shows that pseudo labeling selection mechanism plays an important role and using conﬁdence scores to accept pseudo labels results in the poorest performance, much below the source-only ANLS values and even worse than not using pseudo labeling.
Researcher Affiliation	Industry	Sayna Ebrahimi EMAIL Google Cloud AI Research Sercan Ö. Arik EMAIL Google Cloud AI Research Tomas Pﬁster tpﬁster@google.com Google Cloud AI Research
Pseudocode	Yes	Algorithm 1 Doc TTA for closed-set TTA in VDU ... Algorithm 2 Doc UDA for closed-set UDA in VDU
Open Source Code	No	To ensure full reproducibility, we will release our code upon acceptance.
Open Datasets	Yes	Our benchmark datasets are available at https://saynaebrahimi.github.io/Doc TTA.html. ... We use three publicly-available datasets to construct our benchmarks. These datasets can be downloaded from their original hosts under their terms and conditions: FUNSD Jaume et al. (2019) License, instructions to download, and term of use can be found at https://guillaumejaume.github.io/FUNSD/work/ SROIE Huang et al. (2019) License, instructions to download, and term of use can be found at https://github.com/zzz David/ICDAR-2019-SROIE Doc VQA Mathew et al. (2021) License, instructions to download, and term of use can be found at https://www.docvqa.org/datasets/doccvqa
Dataset Splits	Yes	To better highlight the impact of distribution shifts and to study the methods that are robust against them, we introduce new benchmarks for VDU. Our benchmark datasets are constructed from existing popular and publicly-available VDU data to mimic real-world challenges. We have attached the training and test splits for all our benchmark datasets in the supplementary materials. ... As a representative distribution shift challenge on FUNSD, we split the source and target documents based on the sparsity of available information measure. The original dataset has 9707 semantic entities and 31,485 words with 4 categories of entities question, answer, header, and other, where each category (except other) is either the beginning or the intermediate word of a sentence. Therefore, in total, we have 7 classes. We ﬁrst combine the original training and test splits and then manually divide them into two groups. We set aside 149 forms that are ﬁlled with more texts for the source domain and put 50 forms that are sparsely ﬁlled for the target domain. We randomly choose 10 out of 149 documents for validation, and the remaining 139 for training. ... Table 4: Number of documents in the source and target domains in FUNSD-TTA and SROIE-TTA benchmarks. ... Table 5: FUNSD-TTA Source Training 139 Source Validation 10 Source Evaluation, Target Training, Target Evaluation 50 ... Table 6: SROIE-TTA Source Training 600 Source Validation 39 Source Evaluation, Target Training, Target Evaluation 347 ... Table 7: Number of documents in each domain of our Doc VQA-TTA benchmark. Layout (L) Emails&Letters (E) Tables&Lists (T) Figures&Diagrams (F) Source Training 1807 1417 592 150 Source Validation 200 157 65 17 Source Evaluation, Target Training, Target Evaluation 512 137 187 49
Hardware Specification	Yes	We use Py Torch (Paszke et al., 2019) on Nvidia Tesla V100 GPUS for all the experiments.
Software Dependencies	No	We use Py Torch (Paszke et al., 2019) on Nvidia Tesla V100 GPUS for all the experiments. For source training, we use Layout LMv2BASE pre-trained on IIT-CDIP dataset and ﬁne-tune it with labeled source data on our desired task.
Experiment Setup	Yes	Details on training and hyper parameter tuning are provided in Appendix. ... For all VDU tasks, we build task-speciﬁc classiﬁer head layers over the text embedding of Layout LMv2BASE outputs. For entity recognition and key-value extraction tasks, we use the standard cross-entropy loss and for Doc VQA task, we use the binary cross-entropy loss on each token to predict whether it is the starting/ending position of the answer or not. We use Adam W (Loshchilov & Hutter, 2017) optimizer and train source model with batch sizes of 32, 32, and 64 for 200, 200, and 70 epochs with a learning rate of 5 10 5 for entity recognition, key-value extraction, and Doc VQA benchmarks, respectively with an exception of Figures & Diagrams domain on which we used a learning rate of 10 5. ... We used a simple grid search to ﬁnd the optimal set of hyper parameters with the following search space: Learning rate {10 5, 2.5 10 5, 5 10 5} Weight decay {0, 0.01} Batch size {1, 4, 5, 8, 32, 40, 48, 64} Uncertainty threshold γ {1.5, 2}