reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChronoFact: Timeline-based Temporal Fact Verification

Authors: Anab Maulana Barik, Wynne Hsu, Mong Li Lee

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of our approach in handling the intricacies of temporal claim verification. We also introduce a new dataset of complex temporal claims involving timeline-based reasoning for the training and evaluation of our proposed framework. Sections 5, 5.1, 5.2, and 5.3 further detail performance studies, sensitivity experiments, comparative experiments, and ablation studies, respectively, all indicating empirical validation.
Researcher Affiliation	Academia	Anab Maulana Barik1 , Wynne Hsu1,2 and Mong Li Lee1,3 1School of Computing, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore 3Centre for Trusted Internet and Community, National University of Singapore, Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes the framework components and their processes in detail using natural language and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository. A footnote references an arXiv paper for micro F1 scores, but this is not a code repository.
Open Datasets	Yes	We introduce a new benchmark dataset called Chrono Claims that is designed for enhancing the accuracy and complexity of timeline-based fact verification. Besides the Chrono Claims dataset, we also use the T-FEVER and T-FEVEROUS datasets [Barik et al., 2024]. These datasets are derived from the benchmark fact verification datasets FEVER [Thorne et al., 2018a], FEVER2.0 [Thorne et al., 2018b] and FEVEROUS [Aly et al., 2021] respectively... We also evaluate our method on the T-Quan Temp, a subset of the Quan Temp [Venktesh et al., 2024] dataset...
Dataset Splits	Yes	In total, we generated 40,249, 3,544 and 3,735 claims for the training, validation and test sets. (for Chrono Claims) We use 80% of the data for training and 20% for testing. (for T-FEVER, T-FEVEROUS and T-Quan Temp)
Hardware Specification	No	The paper mentions implementing the framework using Hugging Face Transformers Library with PyTorch, but it does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions 'Hugging Face Transformers Library with PyTorch' and 'flan-T5 base' but does not specify version numbers for these software components, which are necessary for reproducible dependency descriptions.
Experiment Setup	Yes	The Event Encoder use flan-T5 base [Chung et al., 2024] with a hidden size of 768. In the Multi-level Attention Encoder, the token-level, event-level, and time-level representations pass through a linear layer of dimension 768 to calculate attention scores. The hidden size of the fully connected layers is set to 192. The Chronological Order Classifier uses two layers of Bi-LSTM, each with a hidden size of 768, and the fully connected layers have a hidden size of 192, matching those in the Claim Classifier. We train the model using Adafactor with a batch size of 8 and a learning rate of 5e-5 for 5 epochs on each dataset. The best performance is achieved when µ = 0.3 across all datasets and we use this value for the rest of the experiments. the optimal performance for T-FEVER, T-FEVEROUS, and Chrono Claims was achieved when k is 3, 5, and 7 respectively, and we use these values in our experiments.