Descriptive and Discriminative Document Identifiers for Generative Retrieval

Authors: Jiehan Cheng, Zhicheng Dou, Yutao Zhu, Xiaoxi Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on the MS MARCO and NQ320k dataset illustrate the effectiveness of the approach.
Researcher Affiliation Academia Gaoling School of Artificial Intelligence, Renmin University of China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual descriptions and mathematical equations, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology.
Open Datasets Yes We experiment on two widely recognized datasets: MS MARCO (Bajaj et al. 2016) and Natural Questions (NQ) (Kwiatkowski et al. 2019).
Dataset Splits Yes Following NOVO (Wang et al. 2023), we eliminate duplicate documents in NQ based on document titles and use the training set and the validation set divided in NQ as our training set and testing set. ... and use the training set and the dev set divided in MS MARCO as our training set and testing set.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or other computing specifications used for running the experiments.
Software Dependencies No The paper mentions using 'T5-base' as the base model and 'nltk' for n-gram processing, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes On MS300k, we choose similarity threshold λ1 = 0.99, MRR threshold λ2 = 0.1 to improve the diversity of the synthetic queries so as to reflect the document from multiple perspectives, while on NQ320k, we set similarity threshold λ1 = 0.99, MRR threshold λ2 = 0.6 to improve the retrieval performance of the query. We choose the number of n-grams ng = 3 to compose the Doc IDs on both datasets.