Position: Retrieval-augmented systems can be dangerous medical communicators
Authors: Lionel Wong, Ayman Ali, Raymond M Xiong, Zejiang Shen, Yoon Kim, Monica Agrawal
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. |
| Researcher Affiliation | Academia | 1MIT CSAIL 2Stanford University 3Duke University. Correspondence to: Lionel Wong <EMAIL>, Monica Agrawal <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/ |
| Open Datasets | Yes | The code, queries, and results can be found at: https://github.com/rayarxti/rag-medical-communicator/ |
| Dataset Splits | No | The paper describes a query study and evaluation methods for responses from existing RAG systems (Google AIO, Perplexity AI) and LLM-as-a-judge evaluations, but it does not specify training/test/validation splits for a dataset used to train a model developed by the authors. |
| Hardware Specification | No | The paper mentions using Google's AI Overview and Perplexity AI, and LLMs like GPT-4o for evaluation, but does not specify the hardware used by the authors to conduct their analysis or run these models. |
| Software Dependencies | No | The paper mentions Perplexity used 'llama-3.1-sonar-huge-128k-online' and that 'two different versions of GPT-4o' were used for LLM-as-a-judge. However, these are models being used, not the specific software dependencies (e.g., Python libraries with version numbers) for the authors' own implementation or analysis code. |
| Experiment Setup | Yes | We undertake a large-scale analysis of two major retrieval-augmented search engines, Google s AI Overview (Google AIO) and Perplexity AI1. ... We design a set of procedurally generated queries (Table 1) ... We used LLM-as-a-judge to identify a predetermined set of misleading behaviors. ... For these LLM-as-a-judge evaluations, we leveraged two different versions of GPT-4o, spot-checked the labels for quality, and manually adjudicated low-confidence annotations where the two versions of GPT disagreed. |