reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DOGR: Leveraging Document-Oriented Contrastive Learning in Generative Retrieval

Authors: Penghao Lu, Xin Dong, Yuansheng Zhou, Lei Cheng, Chuan Yuan, Linjian Mo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that DOGR achieves state-of-the-art performance compared to existing generative retrieval methods on two public benchmark datasets. Further experiments have shown that our framework is generally effective for common identifier construction techniques.
Researcher Affiliation	Industry	Ant Group EMAIL
Pseudocode	No	The paper describes the proposed scheme and two-stage learning strategy in paragraph form and through a visual diagram (Figure 1), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "Our experimental code is implemented on Python 3.8 using transformers 4.37.0, while experiments are conducted on 6 NVIDIA A100 GPUs with 80 GB of memory." This describes the implementation environment but does not explicitly state that the code for the methodology is being released or provide a link to a repository.
Open Datasets	Yes	Natural Questions(NQ320k) (Kwiatkowski et al. 2019) contains 320k training data (relevant query-document pairs), 100k documents, and 7,830 test queries... MS MARCO passage ranking (MS MARCO) (Nguyen et al. 2016) is a large-scale benchmark dataset that includes 8.8 million passages collected from Bing search results and 1 million real-world queries, with the test set containing 6980 queries.
Dataset Splits	Yes	Natural Questions(NQ320k) (Kwiatkowski et al. 2019) contains 320k training data (relevant query-document pairs), 100k documents, and 7,830 test queries... We follow the same setup as previous work (Lee et al. 2023) and split the test set into two subsets: seen test and unseen test.
Hardware Specification	Yes	experiments are conducted on 6 NVIDIA A100 GPUs with 80 GB of memory.
Software Dependencies	Yes	Our experimental code is implemented on Python 3.8 using transformers 4.37.0
Experiment Setup	Yes	In the training phase, batch size is set to 256 and 32, and the model is optimized for up to 3M and 1M steps using the Adam optimizer with learning rates 5e-5 for identifier generation stage and document-level ranking stage, respectively. The number of negatives from retrieval-augmented negative sampling is set to 4 per query, while the prefix-oriented negative sampling employs in-batch negatives. In the document ranking stage, τ is set to 0.5 as the temperature parameter for contrastive learning, and λg is set to 0.1 to balance the generative task and the contrastive learning task. In the inference phase, we use the beam search with constrained decoding during inference and set the beam size to 100.