Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval
Authors: Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models. |
| Researcher Affiliation | Collaboration | Guangyuan Ma1,2*, Yongliang Ma3, Xing Wu1,2, Zhenpeng Su1,2, Ming Zhou3, Songlin Hu1,2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Langboat Technology, Beijing, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Task-level Distributionally Robust Optimization (t DRO) algorithm steps in narrative text under sections like "Task-level Distributionally Robust Optimization", "Info NCE Loss", "Optimization Objective", "Weights Update", "Relative Loss Measurement", and "Proxy Model Update", but does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/tdro-llm/tdro |
| Open Datasets | Yes | Datasets https://huggingface.co/tdro-llm |
| Dataset Splits | Yes | Benchmark (# Dataset) MIRACL (18) MKQA (25) Be IR (15) ... MS-MARCO in Be IR uses the dev split because there is no public test label. ... cross-lingual retrieval is performed by using queries (6.6k for each) of different languages to recall relevant English Wikipedia with 2.7M passages. |
| Hardware Specification | Yes | All trainings are performed on 8 NVIDIA H800 GPUs with 4.5 hours for t DRO and less than 10 hours for all LLM-DR fine-tunings. |
| Software Dependencies | No | The paper mentions software components and techniques like "Gradient cache (Gao et al. 2021), flash attention 2 (Dao 2023), full-shard data parallel (FSDP), activation checkpointing and low-rank adapter (Lo RA) (Hu et al. 2022)" but does not provide specific version numbers for these or other core software dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | t DRO is performed with a total batch size of 2048, query & document maximum sequence lengths of 128 & 512, a proxy model learning rate ηθ of 1e-4, contrastive temperature τ of 0.002, weights learning rate ηα of 2e-2, and seed of 42. ... Contrastive learning is performed with the same batch size, sequence lengths, model learning rate (1e-4), and contrastive temperature as stated before. |