Descriptive and Discriminative Document Identifiers for Generative Retrieval
Authors: Jiehan Cheng, Zhicheng Dou, Yutao Zhu, Xiaoxi Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on the MS MARCO and NQ320k dataset illustrate the effectiveness of the approach. |
| Researcher Affiliation | Academia | Gaoling School of Artificial Intelligence, Renmin University of China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual descriptions and mathematical equations, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We experiment on two widely recognized datasets: MS MARCO (Bajaj et al. 2016) and Natural Questions (NQ) (Kwiatkowski et al. 2019). |
| Dataset Splits | Yes | Following NOVO (Wang et al. 2023), we eliminate duplicate documents in NQ based on document titles and use the training set and the validation set divided in NQ as our training set and testing set. ... and use the training set and the dev set divided in MS MARCO as our training set and testing set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or other computing specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'T5-base' as the base model and 'nltk' for n-gram processing, but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | On MS300k, we choose similarity threshold λ1 = 0.99, MRR threshold λ2 = 0.1 to improve the diversity of the synthetic queries so as to reflect the document from multiple perspectives, while on NQ320k, we set similarity threshold λ1 = 0.99, MRR threshold λ2 = 0.6 to improve the retrieval performance of the query. We choose the number of n-grams ng = 3 to compose the Doc IDs on both datasets. |