reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Illusion of Empathy: How AI Chatbots Shape Conversation Perception

Authors: Tingting Liu, Salvatore Giorgi, Ankit Aich, Allison Lahnala, Brenda Curtis, Lyle Ungar, João Sedoc

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Analyzing 155 conversations from two datasets, we found that while GPT-based chatbots were rated significantly higher in conversational quality, they were consistently perceived as less empathetic than human conversational partners. We reported four main experiments below to explore the relationship between perceived empathy, chatbot identity, language use, and their impact on quality in conversations with chatbots and humans: analyzing psychological ratings, using LLM annotations, developing a perceived empathy model, and evaluating pre-trained empathy models.
Researcher Affiliation	Academia	1National Institute on Drug Abuse 2University of Pennsylvania 3New York University
Pseudocode	No	The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and additional experiments were provided in supplements.1 1 https://github.com/hellotingting/Botvs Human Empathy.git.
Open Datasets	Yes	In this paper, we combine the following three datasets: Empathic Conversations Dataset (EC; Omitaomu et al. 2022) WASSA 2023 shared task Dataset (Barriere et al. 2023) WASSA 2024 shared task Dataset (Giorgi et al. 2024).
Dataset Splits	Yes	Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17.
Hardware Specification	No	The paper mentions the use of GPT-3.5-turbo and GPT-4-0125-preview models and R for analysis, but does not specify any hardware details (e.g., GPU, CPU models, or cloud computing instances) used for running the experiments or training their models.
Software Dependencies	No	The paper mentions using R and the lmer() package for statistical analysis, and the DLATK Python package for a perceived empathy model. However, specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	Using 10-fold cross-validation with an l2 penalized Ridge regression (regularization term λ chosen as 10,000 using nested cross-validation), we obtained a prediction accuracy of Pearson r = 0.17.