reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

Authors: Guobin Shen, Dongcheng Zhao, Aorigele Bao, Xiang He, Yiting Dong, Yi Zeng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed Stress Prompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions.
Researcher Affiliation	Academia	1 Brain Cog Lab, Institute of Automation, Chinese Academy of Sciences, 2 Beijing Institute of Al Safety and Governance, 3 Beijing Key Laboratory of AI Safety and Superalignment, 4 Center for Long-term Artificial Intelligence, 5 School of Future Technology, University of Chinese Academy of Sciences, Corresponding author (EMAIL)
Pseudocode	No	The paper describes methodologies such as 'Stress Prompt Construction', 'Stress Prompt Evaluation', and 'Stress Prompt Analysis' using descriptive text and mathematical formulas (Eq. 1, 2, 3, 4), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "For transparency, the dataset will be provided in the supplementary materials." However, it does not provide an explicit statement or link for the source code for the methodology described in the paper. It mentions that "The evaluations were conducted using the lm eval (Gao et al. 2023) framework with default settings," which is a third-party tool.
Open Datasets	Yes	We developed an innovative dataset, Stress Prompt, consisting of meticulously crafted prompts designed to induce varying levels of stress, grounded in established psychological frameworks. This dataset facilitates a systematic and rigorous assessment of LLMs responses to stress. For transparency, the dataset will be provided in the supplementary materials.
Dataset Splits	No	The paper describes the construction of the 'Stress Prompt' dataset and its annotation process, classifying prompts into stress levels. It also mentions using various benchmark datasets (IFEval, BBH, MATH, etc.) for evaluation with the 'lm eval' framework. However, it does not explicitly provide specific train/test/validation split percentages, sample counts, or a detailed splitting methodology for either the Stress Prompt dataset or the benchmark datasets used, beyond stating that 'lm eval' was used with default settings, which often handles splits internally but is not described in the paper itself.
Hardware Specification	Yes	All evaluations were performed on NVIDIA A100 GPUs.
Software Dependencies	No	The evaluations were conducted using the lm eval (Gao et al. 2023) framework with default settings. The paper does not specify version numbers for 'lm eval' or any other software dependencies, such as programming languages, deep learning frameworks, or libraries.
Experiment Setup	Yes	The generation temperature was set to 0, and specific dialogue tokens were used to ensure consistency. We utilized a range of benchmarks that assessed emotional intelligence, bias detection, instruction following, reasoning, and mathematical problem-solving... Baseline prompts used for comparison were you are a helpful assistant and let s think step by step.