reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Retrieval-augmented systems can be dangerous medical communicators

Authors: Lionel Wong, Ayman Ali, Raymond M Xiong, Zejiang Shen, Yoon Kim, Monica Agrawal

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems.
Researcher Affiliation	Academia	1MIT CSAIL 2Stanford University 3Duke University. Correspondence to: Lionel Wong <EMAIL>, Monica Agrawal <EMAIL>.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/
Open Datasets	Yes	The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/
Dataset Splits	No	The paper describes a query study and evaluation methods for responses from existing RAG systems (Google AIO, Perplexity AI) and LLM-as-a-judge evaluations, but it does not specify training/test/validation splits for a dataset used to train a model developed by the authors.
Hardware Specification	No	The paper mentions using Google's AI Overview and Perplexity AI, and LLMs like GPT-4o for evaluation, but does not specify the hardware used by the authors to conduct their analysis or run these models.
Software Dependencies	No	The paper mentions Perplexity used 'llama-3.1-sonar-huge-128k-online' and that 'two different versions of GPT-4o' were used for LLM-as-a-judge. However, these are models being used, not the specific software dependencies (e.g., Python libraries with version numbers) for the authors' own implementation or analysis code.
Experiment Setup	Yes	We undertake a large-scale analysis of two major retrieval-augmented search engines, Google s AI Overview (Google AIO) and Perplexity AI1. ... We design a set of procedurally generated queries (Table 1) ... We used LLM-as-a-judge to identify a predetermined set of misleading behaviors. ... For these LLM-as-a-judge evaluations, we leveraged two different versions of GPT-4o, spot-checked the labels for quality, and manually adjudicated low-confidence annotations where the two versions of GPT disagreed.