Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval

Authors: Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models.
Researcher Affiliation Collaboration Guangyuan Ma1,2*, Yongliang Ma3, Xing Wu1,2, Zhenpeng Su1,2, Ming Zhou3, Songlin Hu1,2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Langboat Technology, Beijing, China EMAIL, EMAIL
Pseudocode No The paper describes the Task-level Distributionally Robust Optimization (t DRO) algorithm steps in narrative text under sections like "Task-level Distributionally Robust Optimization", "Info NCE Loss", "Optimization Objective", "Weights Update", "Relative Loss Measurement", and "Proxy Model Update", but does not present them in a structured pseudocode or algorithm block.
Open Source Code Yes Code https://github.com/tdro-llm/tdro
Open Datasets Yes Datasets https://huggingface.co/tdro-llm
Dataset Splits Yes Benchmark (# Dataset) MIRACL (18) MKQA (25) Be IR (15) ... MS-MARCO in Be IR uses the dev split because there is no public test label. ... cross-lingual retrieval is performed by using queries (6.6k for each) of different languages to recall relevant English Wikipedia with 2.7M passages.
Hardware Specification Yes All trainings are performed on 8 NVIDIA H800 GPUs with 4.5 hours for t DRO and less than 10 hours for all LLM-DR fine-tunings.
Software Dependencies No The paper mentions software components and techniques like "Gradient cache (Gao et al. 2021), flash attention 2 (Dao 2023), full-shard data parallel (FSDP), activation checkpointing and low-rank adapter (Lo RA) (Hu et al. 2022)" but does not provide specific version numbers for these or other core software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes t DRO is performed with a total batch size of 2048, query & document maximum sequence lengths of 128 & 512, a proxy model learning rate ηθ of 1e-4, contrastive temperature τ of 0.002, weights learning rate ηα of 2e-2, and seed of 42. ... Contrastive learning is performed with the same batch size, sequence lengths, model learning rate (1e-4), and contrastive temperature as stated before.