Measuring And Improving Persuasiveness Of Large Language Models
Authors: SOMESH SINGH, Yaman Singla, Harini S I, Balaji Krishnamurthy
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. Our findings indicate that the simulative persuasion capabilities of LLMs are barely above random, however, their generative persuasion capabilities are much better. For instance, GPT-4o loses only 36% times when playing against the best human persuader. Further, we find that LLMs persuasiveness correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. |
| Researcher Affiliation | Industry | Somesh Singh , Yaman K Singla , Harini SI , Balaji Krishnamurthy Adobe Media and Data Science Research (MDSR) # EMAIL |
| Pseudocode | Yes | Listing 3: Behavior Simulation System prompt: You are an expert Twitter marketer responsible for evaluating your brand s tweets quality and engagement potential . I am giving the following details to you: text content , attached media ( if any), date and time when the tweet has to be posted, your brand name, and the username of the Twitter account (your brand might have multiple subbrands) . Analyze the tweet s relevance , creativity , clarity , originality , brand tone and voice all from the perspective of the tweet s potential for generating user interaction . Provide a concise assessment of the tweet s potential impact on the target audience. |
| Open Source Code | Yes | We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications. |
| Open Datasets | Yes | With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications. |
| Dataset Splits | Yes | The test set is composed by holding out all samples of a number of randomly chosen accounts (company-stratified sampling) (unknown sender as per the communication framework) and time after a certain date (time-stratified sampling) (unknown time). The test set contains 8k, 13k, and 9k pairs of tweets for brand, time, and random split. All the test sets are balanced, and we use accuracy to report the results. To eliminate positional bias (Zheng et al., 2024) when finding which tweet performs better in a pair, we compute results on both pairs (T1,T2) and (T2,T1). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments. |
| Software Dependencies | Yes | We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings |
| Experiment Setup | Yes | We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings: 1. We instruction fine-tune Vicuna-1.5 13B model for content and behavior simulation tasks. In behavior simulation (BS) (Listing G), we teach a model to predict likes given content, speaker, and time and in content simulation (CS) (Listing G), we teach the model to generate the content given the required number of likes, speaker, and time. 2. We fine-tune the Vicuna-1.5 13B model for the tasks of content simulation (CS), behavior simulation (BS), and transsuasion (TS) (all types). 3. We developed a custom prompt (Listing 22) to instruct Vicuna-1.5 13B to generate differences between tweet T2 (high likes) and T1 (low likes) for a given pair (T1, T2) and explain the potential reasons for T2 s superior performance compared to T1. The generated explanation (I) was appended to 30,000 training samples, modifying the training data structure as follows: for generative transsuasion (TS-GT): (T1,I) as input and T2 as the output, and for comparative transsuasion (TS-CT): (T1, T2, I) as the input and T1 or T2 as the output. It is important to note that the explanation I is used only in the training samples and is not provided during testing. |