reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

Authors: Rui Patrick Xian, Alex Jihun Lee, Satvik Lolla, Vincent Wang, Russell Ro, Qiming Cui, Reza Abbasi-Asl

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We examined the use of type-consistent entity substitution as a template for collecting adversarial entities for medium-sized billion-parameter LLMs with biomedical knowledge. To this end, we developed an embedding space, gradient-free attack based on powerscaled distance-weighted sampling for robustness evaluation, which has a low query budget and controllable coverage. Our method has favorable query efficiency and scaling over alternative approaches based on blackbox gradient-guided search, which we demonstrated for adversarial distractor generation in biomedical question answering. Subsequent failure mode analysis uncovered two regimes of adversarial entities on the attack surface with distinct characteristics. We also showed that entity substitution attacks can manipulate token-wise Shapley value explanations, which become deceptive in this setting.
Researcher Affiliation	Academia	R. Patrick Xian1, Alex J. Lee1 Satvik Lolla2 Vincent Wang1 Russell Ro2,1 Qiming Cui2,1 Reza Abbasi-Asl1, 1UC San Francisco 2UC Berkeley EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 TCES attack template for collecting adversarial entities in distractors. Inputs: Question Q, LLM, entity type τ, budget B. Internals: Number of choices Nch, text t, entity Ent, embedding Emb, key or correct answer k. Outputs: Number of queries and the replacement entity, None if unsuccessful after all attempts. TCESAttacker(Q, Model, τ, B) tchoices, k Q.choices, Q.answer for c 1 to Nch do tch tchoices[c] Entch NERecognizer(tch) Enttfch Type Filter(Entch, τ) Entkey, Entdistrc Split By Label(Enttfch) Entvictim Rank Select(Entkey, Entdistrc, Emb) i 0 while i < B do Entperturb Sampler(Entkey, Entvictim, Entvocab, Emb) Q Q (Entvictim Entperturb) k Model(Q ) if Goal Func(k , k) 1 then return i, Entperturb Entvocab Entvocab\ Entperturb i i + 1 return
Open Source Code	Yes	1The code developed and datasets used for the work are available at https://github.com/RealPolitiX/qstab.
Open Datasets	Yes	The code developed and datasets used for the work are available at https://github.com/RealPolitiX/qstab. ... We sourced vocabulary datasets of drug and disease names from existing public databases. The drug names dataset (FDA-drugs) contains over 2.3k unique entities from known drugs approved by the United States Food and Drug Administration (FDA) and curated by Drug Central5 (Ursu et al., 2017). ... The disease names dataset6 (CTD-diseases) contains over 9.8k unique entities from the Comparative Toxicogenomics Database (Davis et al., 2009)... Biomedical QA datasets We selected over 9.3k questions from the Med QA-USMLE (Jin et al., 2021) dataset and over 3.8k questions from the Med MCQA (Pal et al., 2022) dataset for benchmarking. ... Both datasets are publicly available and don t contain personal information.
Dataset Splits	No	The paper mentions using Med QA-USMLE and Med MCQA datasets for benchmarking, stating: "We selected over 9.3k questions from the Med QA-USMLE (Jin et al., 2021) dataset and over 3.8k questions from the Med MCQA (Pal et al., 2022) dataset for benchmarking." However, it does not provide specific training/test/validation splits or how the data was partitioned for their experiments.
Hardware Specification	No	The paper mentions "multi-GPU split-model inference" but does not specify the models or types of GPUs or any other hardware components used for running the experiments. It also mentions "Model inference of Palmyra-Med-20B (Kamble & Alshikh, 2023) used 4-bit quantization to improve speed" but this is a technique, not a hardware specification.
Software Dependencies	No	The paper mentions several software components, frameworks, and models such as "scispa Cy (Neumann et al., 2019)", "textattack framework (Morris et al., 2020b)", "Sentence Transformer (Reimers & Gurevych, 2019) with the Ro BERTa-large model (all-roberta-large-v1)", "CODER (Yuan et al., 2022)", and "GTE-base (Li et al., 2023b)". However, it does not provide specific version numbers for these software libraries or frameworks, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	All models were evaluated at zero temperature or in the non-sampling setting and model inference was conducted in the zero-shot setting with only basic prompt instructions (see the prompt structure in Appendix D). ... We used fixed query budgets (B) for three main types of attacks: (i) Single-query sampling-based attacks were used as the reference because the Discrete ZOO attack requires a minimum of 3 model queries; (ii) Multi-query attacks used a budget of 8 for reasonable computational cost across all models and attack settings for both samplingand search-based attacks; (iii) The query scaling trends of specific LLMs and different attack settings were investigated with a series of query budgets under 100 per input instance. ... For single-query PDWS attacks, we tuned the hyperparameter n within the interval of [-50, 50] using grid search with a step of 5 or 10 because the local maxima of ASR appear on the positive and negative sides. For multi-query PDWS attacks, hyperparameter n was briefly re-tuned around its optimal value in the single-query attack. Most tuned hyperparameters fall within [-30, -5] and [5, 30].