Augmenting Ad-Hoc IR Dataset for Interactive Conversational Search
Authors: Pierre ERBACHER, Jian-Yun Nie, Philippe Preux, Laure Soulier
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR. (Abstract) ... In this section, we evaluate our methodology, and particularly, the quality of simulated interactions. ... Section 4.1.1 Datasets ... Section 4.1.2 Baselines and metrics ... Section 4.2 Evaluation of the generated interactions |
| Researcher Affiliation | Academia | Pierre Erbacher EMAIL Sorbonne Université Jian-Yun Nie EMAIL Université de Montréal Philippe Preux EMAIL Inria, Université de Lille Laure Soulier EMAIL Sorbonne Université |
| Pseudocode | Yes | Algorithm 1 Offline methodology for building Mixed-Initiative IR dataset |
| Open Source Code | No | We will release, upon acceptance, the complete generated datasets as well as the clarifying model CM and the user simulation US to allow the generation of additional interactions. |
| Open Datasets | Yes | We focus here on the Ms Marco 2021 passages dataset (Nguyen et al., 2016)... To train the clarifying model CM, we use the filtered version of the Clari Q dataset proposed in (Sekulić et al., 2021)... to generate simulated interactions on a new Natural Questions dataset (Kwiatkowski et al., 2019). |
| Dataset Splits | Yes | To build a final dataset including training and testing sets, we respectively apply the offline evaluation methodology (Algorithm 1) on the other half of the training set (not used to train the user simulation) and the online evaluation methodology (Figure 1) on the test set of the Ms Marco dataset. ... Train set Number of query 500K ... Test set Number of query 6980 |
| Hardware Specification | Yes | The model fine-tuning takes approximately 4 hours on 4 RTX 3080 (24 Go). |
| Software Dependencies | No | For both CM and US, we used the pre-trained T5 checkpoint available on the Huggingface hub (Raffel et al., 2020; Wolf et al., 2019)... Keywords embeddings are computed using an off-the-shelf pre-trained Mini LM-L6-v2 model (Reimers & Gurevych, 2019b)... we perform a first-stage retrieval on the initial query using the pyserini (Lin et al., 2021) implementation of BM25. No specific version numbers for software libraries like Python, PyTorch, or PySerini were provided. |
| Experiment Setup | Yes | To finetune these two models, we used teacher forcing (Williams & Zipser, 1989) and a cross-entropy loss. For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 5.10 5 with a batch size of 64. ... The number of extracted words is fixed to k = 5 for the overall experiments. For inference, we use nucleus sampling (p=0.95) for the CM and US models. ... We fine-tune this model on our train set in 1 epoch, using our methodology with teacher forcing and a cross-entropy loss. We consider a maximum sequence length of 512 and a batch size of 128 sequences. ... For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 10 4. |