Augmenting Ad-Hoc IR Dataset for Interactive Conversational Search

Authors: Pierre ERBACHER, Jian-Yun Nie, Philippe Preux, Laure Soulier

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR. (Abstract) ... In this section, we evaluate our methodology, and particularly, the quality of simulated interactions. ... Section 4.1.1 Datasets ... Section 4.1.2 Baselines and metrics ... Section 4.2 Evaluation of the generated interactions
Researcher Affiliation Academia Pierre Erbacher EMAIL Sorbonne Université Jian-Yun Nie EMAIL Université de Montréal Philippe Preux EMAIL Inria, Université de Lille Laure Soulier EMAIL Sorbonne Université
Pseudocode Yes Algorithm 1 Offline methodology for building Mixed-Initiative IR dataset
Open Source Code No We will release, upon acceptance, the complete generated datasets as well as the clarifying model CM and the user simulation US to allow the generation of additional interactions.
Open Datasets Yes We focus here on the Ms Marco 2021 passages dataset (Nguyen et al., 2016)... To train the clarifying model CM, we use the filtered version of the Clari Q dataset proposed in (Sekulić et al., 2021)... to generate simulated interactions on a new Natural Questions dataset (Kwiatkowski et al., 2019).
Dataset Splits Yes To build a final dataset including training and testing sets, we respectively apply the offline evaluation methodology (Algorithm 1) on the other half of the training set (not used to train the user simulation) and the online evaluation methodology (Figure 1) on the test set of the Ms Marco dataset. ... Train set Number of query 500K ... Test set Number of query 6980
Hardware Specification Yes The model fine-tuning takes approximately 4 hours on 4 RTX 3080 (24 Go).
Software Dependencies No For both CM and US, we used the pre-trained T5 checkpoint available on the Huggingface hub (Raffel et al., 2020; Wolf et al., 2019)... Keywords embeddings are computed using an off-the-shelf pre-trained Mini LM-L6-v2 model (Reimers & Gurevych, 2019b)... we perform a first-stage retrieval on the initial query using the pyserini (Lin et al., 2021) implementation of BM25. No specific version numbers for software libraries like Python, PyTorch, or PySerini were provided.
Experiment Setup Yes To finetune these two models, we used teacher forcing (Williams & Zipser, 1989) and a cross-entropy loss. For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 5.10 5 with a batch size of 64. ... The number of extracted words is fixed to k = 5 for the overall experiments. For inference, we use nucleus sampling (p=0.95) for the CM and US models. ... We fine-tune this model on our train set in 1 epoch, using our methodology with teacher forcing and a cross-entropy loss. We consider a maximum sequence length of 512 and a batch size of 128 sequences. ... For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 10 4.