reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval

Authors: Yaoyang Liu, Junlin Li, Yinjun Wu, Zhen Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25. 5. Experiments 5.1. Experimental setup Performance analysis We perform end-to-end RAG training on the QA datasets introduced in Section 5.1. For this experiment, we not only report the end-to-end QA accuracy in Table 2 but also compare the ground-truth relevant documents or images against the retrieved ones by POQD and baseline methods in Table 1.
Researcher Affiliation	Academia	1School of Infomation, Renmin University 2School of Computer Science, Peking University 3Fundamental Industry Training Center, Tsinghua University. Correspondence to: Yinjun Wu <EMAIL>, Zhen Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Optimize query decomposition Algorithm 2 Training POQD
Open Source Code	Yes	POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25.
Open Datasets	Yes	We employ Web Questions (Web QA) (Berant et al., 2013; Chang et al., 2021), Multi Modal QA (Talmor et al.), Many Modal QA (Hannan et al., 2020) and Strategy QA (Geva et al., 2021a) dataset for experiments. Among these datasets, the former three include questions requiring retrieval from multi-modal data.
Dataset Splits	No	The paper uses well-known benchmark datasets but does not explicitly describe the train/validation/test splits used for its experiments, nor does it refer to predefined splits with specific citations or file names. It mentions selecting questions from datasets but not how the datasets themselves were partitioned for training, validation, and testing.
Hardware Specification	No	The paper mentions "GPU Cluster support with AIBD platform from Fundamental Industry Training Center of Tsinghua University" but does not provide specific details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions various models like "Sentence-Bert model (Reimers, 2019)", "CLIP model (Radford et al., 2021)", "Llama3.1-8B (Dubey et al., 2024)", "Llava-v1.5-7B (Liu et al., 2024)", "GPT-4 (Achiam et al., 2023)", and "RoBERTa model (Liu et al., 2019)". However, it does not specify versions for general software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	Throughout the experiments, the default values of α, τ and κ are configured as 0.02, 3 and 5, respectively. Regarding the configuration for the retrieval process, we retrieve the Top-1 most relevant images and the Top-2 most relevant documents in the image QA and text QA tasks, respectively.