OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?
Authors: Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, Qi Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show substantial room for improvement, as current models can only handle the simplest cases. Even with the specially designed RCA-agent, the bestperforming model, Claude 3.5, solved only 11.34% failure cases. Our work paves the way for future research in this direction. |
| Researcher Affiliation | Collaboration | 1School of Data Science, The Chinese University of Hong Kong, Shenzhen 2Microsoft 3Tsinghua University EMAIL, EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the workflow of RCA-agent and its components (Controller and Executor) using natural language and summarized reasoning chains (Figure 7). Appendix B.2 provides agent prompts, which are instructions given to the LLM, but no structured pseudocode or algorithm blocks are explicitly presented. |
| Open Source Code | Yes | 1The Open RCA code and data are available at Git Hub. |
| Open Datasets | Yes | To answer this question, we propose Open RCA, a public benchmark dataset and evaluation framework for assessing LLMs root cause analysis ability in a practical software operating scenario. Open RCA consists of 335 failure cases collected from three heterogeneous software systems deployed in the real world, accompanied by over 68 GB of de-identified telemetry data. All telemetry data is originating from AIOps Challenge series AIOps (2018); Data, which is open-sourced and licensed under Creative Commons Attribution-Non Commercial 4.0 International: https://creativecommons.org/licenses/by-nc/4.0/. |
| Dataset Splits | No | The paper describes the construction of the Open RCA dataset, including system selection, data balancing (downsampling), data calibration, and query synthesis. It evaluates models on this benchmark of 335 failure cases but does not provide explicit training, validation, or test splits for models to be trained or fine-tuned on, rather it assesses their performance on the entire benchmark. For example, it mentions "The overall accuracy is the average score across all failure cases." This suggests that the benchmark is used as a single evaluation set without further splits for development or testing phases. |
| Hardware Specification | No | The paper states that models were "accessed via APIs" and details the context window sizes (e.g., "at least 128K token context windows"). However, it does not provide specific hardware details such as GPU models, CPU models, or memory specifications used to run the experiments or access the APIs. |
| Software Dependencies | Yes | To solve Open RCA tasks, which require processing long contexts, we selected six models with at least 128K token context windows, including three proprietary models: Claude 3.5, GPT-4o, and Gemini 1.5 Pro; and three open-source models: Mistral Large 2, Command R+, and Llama3.1 Instruct. The model checkpoints are shown in Appendix C.1 Table 7: Checkpoint of each model Name Checkpoint Claude 3.5 claude-3-5-sonnet-20240620 GPT-4o gpt-4o-20240513 Gemini 1.5 Pro gemini-1.5-pro-exp-0801 Mistral Large 2 mistral-large-instruct-2407 Command R+ command-r-plus-08-2024 Llama 3.1 Instruct meta-llama-70B-instruct |
| Experiment Setup | Yes | In this section, we describe the experimental setup used to evaluate LLMs on Open RCA problems. 4.1 SAMPLING-BASED METHODS Given the vast volume of telemetry data, it is impractical to feed all telemetry into the LLMs due to their limited context window. A common strategy to reduce telemetry volume in RCA is sampling Huang et al. (2024); He et al. (2023). Thus, we downsample all telemetry data (including trace, log, and metric) to a frequency of one minute by selecting the first recorded value within each minute, regardless of the original frequency. Meanwhile, we relax the accuracy criteria for predicting failure start time to within one minute of the actual event. However, this sampling is still insufficient as the metrics consist of a large number of KPI types (e.g., memory usage, network delay), necessitating further sampling of KPI types. Thus, we consider two sampling strategies: Oracle Sampling: To investigate the upper bound of the sampling-based method s performance, we introduce the oracle sampling. During benchmark construction, engineers identified a fixed set of golden KPIs that are helpful for identifying the root cause. In the oracle sampling, we filtered these golden KPIs as our target. Balanced Sampling: We use stratified sampling by iteratively selecting one random but unique KPI type from each metric file until the number of sampled KPIs matches that in the Oracle setting. 4.2 LANGUAGE MODELS To solve Open RCA tasks, which require processing long contexts, we selected six models with at least 128K token context windows, including three proprietary models: Claude 3.5, GPT-4o, and Gemini 1.5 Pro; and three open-source models: Mistral Large 2, Command R+, and Llama3.1 Instruct. |