reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Measuring And Improving Persuasiveness Of Large Language Models

Authors: SOMESH SINGH, Yaman Singla, Harini S I, Balaji Krishnamurthy

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. Our findings indicate that the simulative persuasion capabilities of LLMs are barely above random, however, their generative persuasion capabilities are much better. For instance, GPT-4o loses only 36% times when playing against the best human persuader. Further, we find that LLMs persuasiveness correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models.
Researcher Affiliation	Industry	Somesh Singh , Yaman K Singla , Harini SI , Balaji Krishnamurthy Adobe Media and Data Science Research (MDSR) # EMAIL
Pseudocode	Yes	Listing 3: Behavior Simulation System prompt: You are an expert Twitter marketer responsible for evaluating your brand s tweets quality and engagement potential . I am giving the following details to you: text content , attached media ( if any), date and time when the tweet has to be posted, your brand name, and the username of the Twitter account (your brand might have multiple subbrands) . Analyze the tweet s relevance , creativity , clarity , originality , brand tone and voice all from the perspective of the tweet s potential for generating user interaction . Provide a concise assessment of the tweet s potential impact on the target audience.
Open Source Code	Yes	We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.
Open Datasets	Yes	With this motivation, we introduce Persuasion Bench and Persuasion Arena, the first large-scale benchmark and arena containing a battery of tasks to automatically measure the simulative and generative persuasion abilities of large language models. We invite the community to explore and contribute to Persuasion Arena and Persuasion Bench, available at behavior-in-the-wild.github.io/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.
Dataset Splits	Yes	The test set is composed by holding out all samples of a number of randomly chosen accounts (company-stratified sampling) (unknown sender as per the communication framework) and time after a certain date (time-stratified sampling) (unknown time). The test set contains 8k, 13k, and 9k pairs of tweets for brand, time, and random split. All the test sets are balanced, and we use accuracy to report the results. To eliminate positional bias (Zheng et al., 2024) when finding which tweet performs better in a pair, we compute results on both pairs (T1,T2) and (T2,T1).
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments.
Software Dependencies	Yes	We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings
Experiment Setup	Yes	We start with Vicuna-1.5 13B (Touvron et al., 2023; Chiang et al., 2023) and instruction fine-tune it with instructions created using 3 million unique tweets under the following settings: 1. We instruction fine-tune Vicuna-1.5 13B model for content and behavior simulation tasks. In behavior simulation (BS) (Listing G), we teach a model to predict likes given content, speaker, and time and in content simulation (CS) (Listing G), we teach the model to generate the content given the required number of likes, speaker, and time. 2. We fine-tune the Vicuna-1.5 13B model for the tasks of content simulation (CS), behavior simulation (BS), and transsuasion (TS) (all types). 3. We developed a custom prompt (Listing 22) to instruct Vicuna-1.5 13B to generate differences between tweet T2 (high likes) and T1 (low likes) for a given pair (T1, T2) and explain the potential reasons for T2 s superior performance compared to T1. The generated explanation (I) was appended to 30,000 training samples, modifying the training data structure as follows: for generative transsuasion (TS-GT): (T1,I) as input and T2 as the output, and for comparative transsuasion (TS-CT): (T1, T2, I) as the input and T1 or T2 as the output. It is important to note that the explanation I is used only in the training samples and is not provided during testing.