The Illusion of Empathy: How AI Chatbots Shape Conversation Perception

Authors: Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, João Sedoc

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. We reported four main experiments below to explore the relationship between perceived empathy, chatbot identity, language use, and their impact on quality in conversations with chatbots and humans: analyzing psychological ratings, using LLM annotations, developing a perceived empathy model, and evaluating pre-trained empathy models.
Researcher Affiliation Academia 1National Institute on Drug Abuse 2University of Pennsylvania 3New York University
Pseudocode No The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and additional experiments were provided in supplements.1 1 https://github.com/hellotingting/Botvs Human Empathy.git.
Open Datasets Yes In this paper, we combine the following three datasets: Empathic Conversations Dataset (EC; Omitaomu et al. 2022) WASSA 2023 shared task Dataset (Barriere et al. 2023) WASSA 2024 shared task Dataset (Giorgi et al. 2024).
Dataset Splits Yes Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17.
Hardware Specification No The paper mentions the use of GPT-3.5-turbo and GPT-4-0125-preview models and R for analysis, but does not specify any hardware details (e.g., GPU, CPU models, or cloud computing instances) used for running the experiments or training their models.
Software Dependencies No The paper mentions using R and the lmer() package for statistical analysis, and the DLATK Python package for a perceived empathy model. However, specific version numbers for these software dependencies are not provided.
Experiment Setup Yes Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17.