reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BotSim: LLM-Powered Malicious Social Botnet Simulation

Authors: Boyu Qiao, Kun Li, Wei Zhou, Shilong Li, Qianqian Lu, Songlin Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results indicate that detection methods effective on traditional bot datasets perform worse on Bot Sim-24, highlighting the urgent need for new detection strategies to address the cybersecurity threats posed by these advanced bots.
Researcher Affiliation	Academia	1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences EMAIL
Pseudocode	Yes	A complete prompt example is provided in Appendix B.4, and the algorithm for this execution process is further explained in Appendix B.5.
Open Source Code	Yes	Code https://github.com/QQQQQQBY/Bot Sim
Open Datasets	No	Bot Sim-24: LLM-driven Bot Detection Dataset In this section, we present Bot Sim-24, a bot detection dataset powered by LLM. Building on the Bot Sim framework, we simulate information dissemination and user interactions across six Sub Reddits on Reddit. This process results in the creation of the Bot Sim-24 dataset, which includes 1,907 human accounts and 1,000 LLM-driven agent bot accounts. [...] The Bot Sim-24 dataset does not include interactions between humans and bots. Statistics in Table 4 show that such interactions are also relatively sparse in actual OSNs. However, as LLM-powered bots become more prevalent, their high human-like characteristics will inevitably lead to an increase in human-bot (human bot) interactions. As demonstrated by our edge perturbation experiments, this trend will challenge and undermine the effectiveness of GNN-based methods. Furthermore, Table 5 offers a detailed overview of the performance of various LLMs in account detection tasks based on textual content. Additionally, Figure 4 in Appendix A.5 visually illustrates findings on the accuracy of human annotators. These results highlight the difficulty LLMs face distinguishing between text they generate and text authored by humans. Human annotators also struggle to achieve high accuracy in this regard. For additional details, please refer to Appendix A.5. This underscores the critical challenge of detecting LLM-driven bots and emphasizes the urgent need for innovative detection strategies to keep pace with their evolving capabilities.
Dataset Splits	Yes	Consistent with the division used in Twi Bot-20 and MGTAB-22, we randomly divide all datasets into training, validation, and test sets with a ratio of 7:2:1. Table 2 shows the division of the Bot Sim-24 dataset.
Hardware Specification	Yes	Our experiments are conducted on four Tesla V100 GPUs with 32GB of memory.
Software Dependencies	No	Detailed hyperparameter settings can be found in Appendix A.1.
Experiment Setup	No	Detailed hyperparameter settings can be found in Appendix A.1.