reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAIL: Sample-Centric In-Context Learning for Document Information Extraction

Authors: Jinyu Zhang, Zhiyuan You, Jize Wang, Xinyi Le

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments 4.1 Datasets, Metrics, and Details 4.2 Results on DIE Benchmarks 4.3 Comparison with Multi-modal LLMs 4.4 Ablation Studies
Researcher Affiliation	Academia	Jinyu Zhang1, Zhiyuan You2, Jize Wang1, Xinyi Le1 1Shanghai Jiao Tong University 2The Chinese University of Hong Kong EMAIL, EMAIL
Pseudocode	No	The paper includes illustrations of the framework (Figure 2) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/sky-goldfish/SAIL
Open Datasets	Yes	FUNSD (Jaume, Ekenel, and Thiran 2019) is a dataset for understanding the content of tables in scanned documents. ... SROIE (Huang et al. 2019) is another scanned receipt understanding dataset... CORD (Park et al. 2019) is a receipt understanding dataset...
Dataset Splits	Yes	FUNSD (Jaume, Ekenel, and Thiran 2019) is a dataset for understanding the content of tables in scanned documents. It contains 149 tables and 7,411 entities in the training set, and 50 tables and 2,332 entities in the test set. ... SROIE (Huang et al. 2019) is another scanned receipt understanding dataset, containing 626 receipts in the training set and 347 in the test set. ... CORD (Park et al. 2019) is a receipt understanding dataset that contains 800 training data, 100 test data, and 100 validation data.
Hardware Specification	No	The paper mentions using specific LLM APIs (GPT-3.5, GPT-4o) and a specific version of an open-source model (chatglm3-6b-32k) but does not provide details on the hardware used to run experiments or host these models/APIs.
Software Dependencies	No	The paper mentions using Chat GLM3 (chatglm3-6b-32k version), GPT-3.5 (gpt-3.5-turbo API version), GPT-4 (gpt-4o API version) and Sentence-BERT, but does not provide specific version numbers for ancillary software dependencies like programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	For GPT3.5 and GPT-4o, we set the temperature parameter to 0 to enhance the reproducibility. In our experiments, for each test document, we select four textually similar documents and four layout-similar documents as examples due to the limitation of prompt token number. Furthermore, for each filtered test entity, we choose four textually similar entity examples.