reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval

Authors: Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models.
Researcher Affiliation	Collaboration	Guangyuan Ma1,2*, Yongliang Ma3, Xing Wu1,2, Zhenpeng Su1,2, Ming Zhou3, Songlin Hu1,2 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Langboat Technology, Beijing, China EMAIL, EMAIL
Pseudocode	No	The paper describes the Task-level Distributionally Robust Optimization (t DRO) algorithm steps in narrative text under sections like "Task-level Distributionally Robust Optimization", "Info NCE Loss", "Optimization Objective", "Weights Update", "Relative Loss Measurement", and "Proxy Model Update", but does not present them in a structured pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/tdro-llm/tdro
Open Datasets	Yes	Datasets https://huggingface.co/tdro-llm
Dataset Splits	Yes	Benchmark (# Dataset) MIRACL (18) MKQA (25) Be IR (15) ... MS-MARCO in Be IR uses the dev split because there is no public test label. ... cross-lingual retrieval is performed by using queries (6.6k for each) of different languages to recall relevant English Wikipedia with 2.7M passages.
Hardware Specification	Yes	All trainings are performed on 8 NVIDIA H800 GPUs with 4.5 hours for t DRO and less than 10 hours for all LLM-DR fine-tunings.
Software Dependencies	No	The paper mentions software components and techniques like "Gradient cache (Gao et al. 2021), flash attention 2 (Dao 2023), full-shard data parallel (FSDP), activation checkpointing and low-rank adapter (Lo RA) (Hu et al. 2022)" but does not provide specific version numbers for these or other core software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	t DRO is performed with a total batch size of 2048, query & document maximum sequence lengths of 128 & 512, a proxy model learning rate ηθ of 1e-4, contrastive temperature τ of 0.002, weights learning rate ηα of 2e-2, and seed of 42. ... Contrastive learning is performed with the same batch size, sequence lengths, model learning rate (1e-4), and contrastive temperature as stated before.