Measuring And Improving Persuasiveness Of Large Language Models

Authors: SOMESH SINGH, Yaman Singla, Harini S I, Balaji Krishnamurthy

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. Our findings indicate that the simulative persuasion capabilities of LLMs are barely above random, however, their generative persuasion capabilities are much better. For instance, GPT-4o loses only 36% times when playing against the best human persuader. Further, we find that LLMs persuasiveness correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models.
Researcher Affiliation Industry Somesh Singh , Yaman K Singla , Harini SI , Balaji Krishnamurthy Adobe Media and Data Science Research (MDSR) # EMAIL
Pseudocode Yes Listing 3: Behavior Simulation System prompt: You are an expert Twitter marketer responsible for evaluating your brand s tweets quality and engagement potential . I am giving the following details to you: text content , attached media ( if any), date and time when the tweet has to be posted, your brand name, and the username of the Twitter account (your brand might have multiple subbrands) . Analyze the tweet s relevance , creativity , clarity , originality , brand tone and voice all from the perspective of the tweet s potential for generating user interaction . Provide a concise assessment of the tweet s potential impact on the target audience.
Open Source Code Yes We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.
Open Datasets Yes With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.
Dataset Splits Yes The test set is composed by holding out all samples of a number of randomly chosen accounts (company-stratified sampling) (unknown sender as per the communication framework) and time after a certain date (time-stratified sampling) (unknown time). The test set contains 8k, 13k, and 9k pairs of tweets for brand, time, and random split. All the test sets are balanced, and we use accuracy to report the results. To eliminate positional bias (Zheng et al., 2024) when finding which tweet performs better in a pair, we compute results on both pairs (T1,T2) and (T2,T1).
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies Yes We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings
Experiment Setup Yes We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings: 1. We instruction fine-tune Vicuna-1.5 13B model for content and behavior simulation tasks. In behavior simulation (BS) (Listing G), we teach a model to predict likes given content, speaker, and time and in content simulation (CS) (Listing G), we teach the model to generate the content given the required number of likes, speaker, and time. 2. We fine-tune the Vicuna-1.5 13B model for the tasks of content simulation (CS), behavior simulation (BS), and transsuasion (TS) (all types). 3. We developed a custom prompt (Listing 22) to instruct Vicuna-1.5 13B to generate differences between tweet T2 (high likes) and T1 (low likes) for a given pair (T1, T2) and explain the potential reasons for T2 s superior performance compared to T1. The generated explanation (I) was appended to 30,000 training samples, modifying the training data structure as follows: for generative transsuasion (TS-GT): (T1,I) as input and T2 as the output, and for comparative transsuasion (TS-CT): (T1, T2, I) as the input and T1 or T2 as the output. It is important to note that the explanation I is used only in the training samples and is not provided during testing.