reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidence Elicitation: A New Attack Vector for Large Language Models

Authors: Brian Formento, Chuan Sheng Foo, See-Kiong Ng

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLa MA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions. We conducted our confidence elicitation attacks on Meta-Llama-3-8B-Instruct (Touvron et al., 2023) and Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) while performing classification on two common datasets to evaluate adversarial robustness: SST-2, AG-News and one modern dataset: Strategy QA (Geva et al., 2021). We utilize the evaluation framework previously proposed in (Morris et al., 2020), where an evaluation set is perturbed, and we record the following data metrics: Clean accuracy (CA), Accuracy under attack (AUA), the Attack success rate (ASR), Semantic similarity (Sem Sim) based on the Universal Sentence Encoder (Cer et al., 2018). We compare the original perplexity with the new perturbed sample s perplexity. Queries where we subdivide this metric into two categories: All Att Queries Avg and Succ Att Queries Avg. Additionally, we track Total Attack Time. We compare our guided word substitution attacks, CEAttack to Self-Fool Word Sub from (Xu et al., 2024), Text Hoaxer (Ye et al., 2022) and SSPAttack from (Liu et al., 2023).
Researcher Affiliation	Academia	1Institute of Data Science, National University of Singapore 2Institute for Infocomm Research, ASTAR 3Centre for Frontier AI Research, ASTAR EMAIL foo chuan EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Confidence elicitation attack Input: Initial input x, Prompt for perturbation, Vocabulary V = {τ1, τ2, . . . , τV }, original sample confidence py Output: Predicted class ˆy, adversarial sample xadv if conditions are met 1 Initialize generator function gδ and input x. 2 while not exceeded number of queries do 4 xadv Output from gδ 5 until dϵ(xadv) is true 6 Compute prediction and confidence: ˆy, p C fθ(xadv) // Assumes fθ returns prediction and calibrated confidence 7 if p C < py then 8 Substitute x with xadv 9 py p C // Update previous confidence to current 10 else if p C py then 11 Perform an alternative perturbation or adjustment. 12 x Some function of xadv that modifies or reverts changes 13 py p C // Optionally update the reference confidence
Open Source Code	Yes	The code is publicly available at Git Hub: Confidence Elicitation Attacks. 1We release our code in a Git Hub Repository (Confidence Elicitation Attacks)
Open Datasets	Yes	We conducted our confidence elicitation attacks on Meta-Llama-3-8B-Instruct (Touvron et al., 2023) and Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) while performing classification on two common datasets to evaluate adversarial robustness: SST-2, AG-News and one modern dataset: Strategy QA (Geva et al., 2021).
Dataset Splits	No	We utilize the evaluation framework previously proposed in (Morris et al., 2020), where an evaluation set is perturbed, and we record the following data metrics: Clean accuracy (CA), Accuracy under attack (AUA), the Attack success rate (ASR), Semantic similarity (Sem Sim) based on the Universal Sentence Encoder (Cer et al., 2018). The paper refers to using an "evaluation set" within an existing framework but does not explicitly provide the specific training/test/validation splits (e.g., percentages or sample counts) for the datasets used in its experiments.
Hardware Specification	Yes	We use 12 Nvidia A40 GPUs for our testing, every test can be conducted on only 1 A40GPU. For our tests we perturb 500 samples on 1 A40 GPU with 46GB of memory.
Software Dependencies	No	The paper mentions using Meta-Llama-3-8B-Instruct (Touvron et al., 2023) and Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) models, a GPT-2 model for perplexity calculation, and Counter-fitted embeddings (Mrkˇsi c et al., 2016). However, it does not explicitly list specific version numbers for any underlying software libraries or programming languages (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	Prompting From Figure 1, we first initialize the model using a two-step prompting strategy. This strategy consists of an initial k guess query to the model, yielding kpred guesses, followed by a second query to the model to obtain verbalized confidence levels for these k guesses kconf. The confidence levels used are Highest , High , Medium , Low , and Lowest . This technique has been demonstrated to be effective in previous work (Tian et al., 2023; Lin et al., 2022). In our experiments we set k to 20 for SST2 and AG-News and k to 6 for Strategy QA. Model-wise, previous confidence elicitation works (Xiong et al., 2024) use model settings commonly found in generative tasks where the model samples from the top-k (top-k = 40) most probable next tokens, applies top-p (p = 0.92) nucleus sampling, which only considers next tokens with high probability, and uses a temperature setting of τ = 0.7. This setup naturally introduces some randomness to the model, whose behavior is still not fully understood in an adversarial setting, as highlighted in previous work (Huang et al., 2023; Zhao et al., 2024; Russinovich et al., 2024). We keep all the settings consistent with previous work, but set τ 0, which follows previous work related to adversarial evaluation (Xu et al., 2024). Adversarial setup Given a sample x, we first extract an ordered subset of words W x to perturb randomly, with the size of W capped at \|W\| = 5. For each word w W, we obtain a set S of synonyms sourced from Counter-fitted embeddings (Mrkˇsi c et al., 2016).