The Illusion of Empathy: How AI Chatbots Shape Conversation Perception
Authors: Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, João Sedoc
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. We reported four main experiments below to explore the relationship between perceived empathy, chatbot identity, language use, and their impact on quality in conversations with chatbots and humans: analyzing psychological ratings, using LLM annotations, developing a perceived empathy model, and evaluating pre-trained empathy models. |
| Researcher Affiliation | Academia | 1National Institute on Drug Abuse 2University of Pennsylvania 3New York University |
| Pseudocode | No | The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and additional experiments were provided in supplements.1 1 https://github.com/hellotingting/Botvs Human Empathy.git. |
| Open Datasets | Yes | In this paper, we combine the following three datasets: Empathic Conversations Dataset (EC; Omitaomu et al. 2022) WASSA 2023 shared task Dataset (Barriere et al. 2023) WASSA 2024 shared task Dataset (Giorgi et al. 2024). |
| Dataset Splits | Yes | Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17. |
| Hardware Specification | No | The paper mentions the use of GPT-3.5-turbo and GPT-4-0125-preview models and R for analysis, but does not specify any hardware details (e.g., GPU, CPU models, or cloud computing instances) used for running the experiments or training their models. |
| Software Dependencies | No | The paper mentions using R and the lmer() package for statistical analysis, and the DLATK Python package for a perceived empathy model. However, specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17. |