Multilingual LLMs Inherently Reward In-Language Time-Sensitive Semantic Alignment for Low-Resource Languages

Authors: Ashutosh Bajpai, Tanmoy Chakraborty

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evidence underscores the superior performance of CLi TSSA compared to established baselines across three languages Romanian, German, and French, encompassing three temporal tasks and including a diverse set of four contemporaneous LLMs. This marks a significant step forward in addressing resource disparity in the context of temporal reasoning across languages.
Researcher Affiliation Collaboration Ashutosh Bajpai1,2, Tanmoy Chakraborty1 1 Indian Institute of Technology Delhi, India 2 Wipro Research, India EMAIL, EMAIL
Pseudocode No The paper describes methods and procedures in narrative text, without presenting any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Source code and dataset are available at https://github.com/abiitd/clitssa.
Open Datasets Yes 1Source code and dataset are available at https://github.com/abiitd/clitssa.
Dataset Splits Yes Table 2: Dataset statistics for m TEMPREASON. Train Dev Test Time Range 1014-2022 634-2023 998-2023 L1 400,000 4,000 4,000 L2 16,017 5,521 5,397 L3 13,014 4,437 4,426
Hardware Specification No The paper mentions various LLMs used (LLa MA3-8B, Mistral-v1, Vicuna-7b-v1.5, Bloomz-7b1) but does not provide any specific details about the hardware (GPUs, CPUs, memory, etc.) on which these models were run or fine-tuned.
Software Dependencies No The paper mentions using the T5 model, multilingual Sentence-BERT, and distiluse-base-multilingual-cased-v1 as foundational models, but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes A three-shot ICL approach is used throughout the experimental setting, demonstrating superior outcomes compared to both one-shot and two-shot configurations. The value of h and w is set empirically at 30 and 10, respectively. To fine-tune the CLi TSSA retriever model, the distiluse-base-multilingual-cased-v1 serves as the foundational model. This method is systematically applied to each low-resource language across temporal tasks L1, L2 and L3, to ensure optimum performance. Additionally, an integrated CLi TSSA retriever is fine-tuned across languages and temporal tasks. The Train and Dev datasets from m TEMPREASON are used to construct the parallel corpus to fine-tune the CLi TSSA retriever, with a separate held-out test set employed to benchmark all outcomes. We use word level F1 scores and exact match (EM) standards to quantify the LLM s responses. Please refer to the technical appendix for ablations on few-shots, parameters h & w, along with hyperparameters in detail.