Position: Retrieval-augmented systems can be dangerous medical communicators

Authors: Lionel Wong, Ayman Ali, Raymond M Xiong, Zejiang Shen, Yoon Kim, Monica Agrawal

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems.
Researcher Affiliation Academia 1MIT CSAIL 2Stanford University 3Duke University. Correspondence to: Lionel Wong <EMAIL>, Monica Agrawal <EMAIL>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/
Open Datasets Yes The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/
Dataset Splits No The paper describes a query study and evaluation methods for responses from existing RAG systems (Google AIO, Perplexity AI) and LLM-as-a-judge evaluations, but it does not specify training/test/validation splits for a dataset used to train a model developed by the authors.
Hardware Specification No The paper mentions using Google's AI Overview and Perplexity AI, and LLMs like GPT-4o for evaluation, but does not specify the hardware used by the authors to conduct their analysis or run these models.
Software Dependencies No The paper mentions Perplexity used 'llama-3.1-sonar-huge-128k-online' and that 'two different versions of GPT-4o' were used for LLM-as-a-judge. However, these are models being used, not the specific software dependencies (e.g., Python libraries with version numbers) for the authors' own implementation or analysis code.
Experiment Setup Yes We undertake a large-scale analysis of two major retrieval-augmented search engines, Google s AI Overview (Google AIO) and Perplexity AI1. ... We design a set of procedurally generated queries (Table 1) ... We used LLM-as-a-judge to identify a predetermined set of misleading behaviors. ... For these LLM-as-a-judge evaluations, we leveraged two different versions of GPT-4o, spot-checked the labels for quality, and manually adjudicated low-confidence annotations where the two versions of GPT disagreed.