ChronoFact: Timeline-based Temporal Fact Verification

Authors: Anab Maulana Barik, Wynne Hsu, Mong Li Lee

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of our approach in handling the intricacies of temporal claim verification. We also introduce a new dataset of complex temporal claims involving timeline-based reasoning for the training and evaluation of our proposed framework. Sections 5, 5.1, 5.2, and 5.3 further detail performance studies, sensitivity experiments, comparative experiments, and ablation studies, respectively, all indicating empirical validation.
Researcher Affiliation Academia Anab Maulana Barik1 , Wynne Hsu1,2 and Mong Li Lee1,3 1School of Computing, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore 3Centre for Trusted Internet and Community, National University of Singapore, Singapore EMAIL, EMAIL
Pseudocode No The paper describes the framework components and their processes in detail using natural language and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository. A footnote references an arXiv paper for micro F1 scores, but this is not a code repository.
Open Datasets Yes We introduce a new benchmark dataset called Chrono Claims that is designed for enhancing the accuracy and complexity of timeline-based fact verification. Besides the Chrono Claims dataset, we also use the T-FEVER and T-FEVEROUS datasets [Barik et al., 2024]. These datasets are derived from the benchmark fact verification datasets FEVER [Thorne et al., 2018a], FEVER2.0 [Thorne et al., 2018b] and FEVEROUS [Aly et al., 2021] respectively... We also evaluate our method on the T-Quan Temp, a subset of the Quan Temp [Venktesh et al., 2024] dataset...
Dataset Splits Yes In total, we generated 40,249, 3,544 and 3,735 claims for the training, validation and test sets. (for Chrono Claims) We use 80% of the data for training and 20% for testing. (for T-FEVER, T-FEVEROUS and T-Quan Temp)
Hardware Specification No The paper mentions implementing the framework using Hugging Face Transformers Library with PyTorch, but it does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper mentions 'Hugging Face Transformers Library with PyTorch' and 'flan-T5 base' but does not specify version numbers for these software components, which are necessary for reproducible dependency descriptions.
Experiment Setup Yes The Event Encoder use flan-T5 base [Chung et al., 2024] with a hidden size of 768. In the Multi-level Attention Encoder, the token-level, event-level, and time-level representations pass through a linear layer of dimension 768 to calculate attention scores. The hidden size of the fully connected layers is set to 192. The Chronological Order Classifier uses two layers of Bi-LSTM, each with a hidden size of 768, and the fully connected layers have a hidden size of 192, matching those in the Claim Classifier. We train the model using Adafactor with a batch size of 8 and a learning rate of 5e-5 for 5 epochs on each dataset. The best performance is achieved when ยต = 0.3 across all datasets and we use this value for the rest of the experiments. the optimal performance for T-FEVER, T-FEVEROUS, and Chrono Claims was achieved when k is 3, 5, and 7 respectively, and we use these values in our experiments.