reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Augmenting Ad-Hoc IR Dataset for Interactive Conversational Search

Authors: Pierre ERBACHER, Jian-Yun Nie, Philippe Preux, Laure Soulier

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR. (Abstract) ... In this section, we evaluate our methodology, and particularly, the quality of simulated interactions. ... Section 4.1.1 Datasets ... Section 4.1.2 Baselines and metrics ... Section 4.2 Evaluation of the generated interactions
Researcher Affiliation	Academia	Pierre Erbacher EMAIL Sorbonne Université Jian-Yun Nie EMAIL Université de Montréal Philippe Preux EMAIL Inria, Université de Lille Laure Soulier EMAIL Sorbonne Université
Pseudocode	Yes	Algorithm 1 Offline methodology for building Mixed-Initiative IR dataset
Open Source Code	No	We will release, upon acceptance, the complete generated datasets as well as the clarifying model CM and the user simulation US to allow the generation of additional interactions.
Open Datasets	Yes	We focus here on the Ms Marco 2021 passages dataset (Nguyen et al., 2016)... To train the clarifying model CM, we use the filtered version of the Clari Q dataset proposed in (Sekulić et al., 2021)... to generate simulated interactions on a new Natural Questions dataset (Kwiatkowski et al., 2019).
Dataset Splits	Yes	To build a final dataset including training and testing sets, we respectively apply the offline evaluation methodology (Algorithm 1) on the other half of the training set (not used to train the user simulation) and the online evaluation methodology (Figure 1) on the test set of the Ms Marco dataset. ... Train set Number of query 500K ... Test set Number of query 6980
Hardware Specification	Yes	The model fine-tuning takes approximately 4 hours on 4 RTX 3080 (24 Go).
Software Dependencies	No	For both CM and US, we used the pre-trained T5 checkpoint available on the Huggingface hub (Raffel et al., 2020; Wolf et al., 2019)... Keywords embeddings are computed using an off-the-shelf pre-trained Mini LM-L6-v2 model (Reimers & Gurevych, 2019b)... we perform a first-stage retrieval on the initial query using the pyserini (Lin et al., 2021) implementation of BM25. No specific version numbers for software libraries like Python, PyTorch, or PySerini were provided.
Experiment Setup	Yes	To finetune these two models, we used teacher forcing (Williams & Zipser, 1989) and a cross-entropy loss. For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 5.10 5 with a batch size of 64. ... The number of extracted words is fixed to k = 5 for the overall experiments. For inference, we use nucleus sampling (p=0.95) for the CM and US models. ... We fine-tune this model on our train set in 1 epoch, using our methodology with teacher forcing and a cross-entropy loss. We consider a maximum sequence length of 512 and a batch size of 128 sequences. ... For optimization, we use Ada Factor (Shazeer & Stern, 2018), weight decay, and a learning rate of 10 4.