reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?

Authors: Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, Qi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show substantial room for improvement, as current models can only handle the simplest cases. Even with the specially designed RCA-agent, the bestperforming model, Claude 3.5, solved only 11.34% failure cases. Our work paves the way for future research in this direction.
Researcher Affiliation	Collaboration	1School of Data Science, The Chinese University of Hong Kong, Shenzhen 2Microsoft 3Tsinghua University EMAIL, EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the workflow of RCA-agent and its components (Controller and Executor) using natural language and summarized reasoning chains (Figure 7). Appendix B.2 provides agent prompts, which are instructions given to the LLM, but no structured pseudocode or algorithm blocks are explicitly presented.
Open Source Code	Yes	1The Open RCA code and data are available at Git Hub.
Open Datasets	Yes	To answer this question, we propose Open RCA, a public benchmark dataset and evaluation framework for assessing LLMs root cause analysis ability in a practical software operating scenario. Open RCA consists of 335 failure cases collected from three heterogeneous software systems deployed in the real world, accompanied by over 68 GB of de-identified telemetry data. All telemetry data is originating from AIOps Challenge series AIOps (2018); Data, which is open-sourced and licensed under Creative Commons Attribution-Non Commercial 4.0 International: https://creativecommons.org/licenses/by-nc/4.0/.
Dataset Splits	No	The paper describes the construction of the Open RCA dataset, including system selection, data balancing (downsampling), data calibration, and query synthesis. It evaluates models on this benchmark of 335 failure cases but does not provide explicit training, validation, or test splits for models to be trained or fine-tuned on, rather it assesses their performance on the entire benchmark. For example, it mentions "The overall accuracy is the average score across all failure cases." This suggests that the benchmark is used as a single evaluation set without further splits for development or testing phases.
Hardware Specification	No	The paper states that models were "accessed via APIs" and details the context window sizes (e.g., "at least 128K token context windows"). However, it does not provide specific hardware details such as GPU models, CPU models, or memory specifications used to run the experiments or access the APIs.
Software Dependencies	Yes	To solve Open RCA tasks, which require processing long contexts, we selected six models with at least 128K token context windows, including three proprietary models: Claude 3.5, GPT-4o, and Gemini 1.5 Pro; and three open-source models: Mistral Large 2, Command R+, and Llama3.1 Instruct. The model checkpoints are shown in Appendix C.1 Table 7: Checkpoint of each model Name Checkpoint Claude 3.5 claude-3-5-sonnet-20240620 GPT-4o gpt-4o-20240513 Gemini 1.5 Pro gemini-1.5-pro-exp-0801 Mistral Large 2 mistral-large-instruct-2407 Command R+ command-r-plus-08-2024 Llama 3.1 Instruct meta-llama-70B-instruct
Experiment Setup	Yes	In this section, we describe the experimental setup used to evaluate LLMs on Open RCA problems. 4.1 SAMPLING-BASED METHODS Given the vast volume of telemetry data, it is impractical to feed all telemetry into the LLMs due to their limited context window. A common strategy to reduce telemetry volume in RCA is sampling Huang et al. (2024); He et al. (2023). Thus, we downsample all telemetry data (including trace, log, and metric) to a frequency of one minute by selecting the first recorded value within each minute, regardless of the original frequency. Meanwhile, we relax the accuracy criteria for predicting failure start time to within one minute of the actual event. However, this sampling is still insufficient as the metrics consist of a large number of KPI types (e.g., memory usage, network delay), necessitating further sampling of KPI types. Thus, we consider two sampling strategies: Oracle Sampling: To investigate the upper bound of the sampling-based method s performance, we introduce the oracle sampling. During benchmark construction, engineers identified a fixed set of golden KPIs that are helpful for identifying the root cause. In the oracle sampling, we filtered these golden KPIs as our target. Balanced Sampling: We use stratified sampling by iteratively selecting one random but unique KPI type from each metric file until the number of sampled KPIs matches that in the Oracle setting. 4.2 LANGUAGE MODELS To solve Open RCA tasks, which require processing long contexts, we selected six models with at least 128K token context windows, including three proprietary models: Claude 3.5, GPT-4o, and Gemini 1.5 Pro; and three open-source models: Mistral Large 2, Command R+, and Llama3.1 Instruct.