reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can LLMs Understand Time Series Anomalies?

Authors: Zihao Zhou, Rose Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs behavior from time series forecasting research, we formulate key hypotheses about LLMs capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses. Our investigation reveals several surprising findings about LLMs for time series: (1) LLMs understand time series better as images rather than as text, (2) LLMs do not demonstrate enhanced performance when prompted to engage in explicit reasoning about time series analysis. (3) Contrary to common beliefs, LLMs understanding of time series does not stem from their repetition biases or arithmetic abilities. (4) LLMs behaviors and performance in time series analysis vary significantly across different models. This study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand trivial time series anomalies, we have no evidence that they can understand more subtle real-world anomalies. Many common conjectures based on their reasoning capabilities do not hold. All synthetic dataset generators, final prompts, and evaluation scripts have been made available in https://github.com/rose-stl-lab/anomllm.
Researcher Affiliation	Academia	Zihao Zhou Dept of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093, USA EMAIL Rose Yu Dept of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093, USA EMAIL
Pseudocode	Yes	Algorithm 1 Anomaly Generation Process 1: for each dataset type do 2: if multivariate data needed then 3: Randomly select sensors to contain anomalies based on the ratio of anomalous sensors 4: end if 5: for each selected sensor do 6: Generate normal intervals using an exponential distribution with the normal rate 7: Generate anomaly intervals using an exponential distribution with the anomaly rate 8: Ensure minimum durations for both normal and anomaly intervals 9: Apply the appropriate anomaly type to the anomaly intervals: 10: if anomaly type is point or range then 11: Simulate the full time series 12: Directly replace the normal data by the anomaly data / inject noise 13: else if anomaly type is trend or frequency then 14: Simulate region by region to ensure continuity 15: end if 16: end for 17: Record the start and end points of each anomaly interval as ground truth 18: end for
Open Source Code	Yes	All synthetic dataset generators, final prompts, and evaluation scripts have been made available in https://github.com/rose-stl-lab/anomllm.
Open Datasets	Yes	All synthetic dataset generators, final prompts, and evaluation scripts have been made available in https://github.com/rose-stl-lab/anomllm. Still, our experiments on real-world Yahoo S5 dataset Laptev & Amizadeh (2015) (see Appendix D) show consistent findings.
Dataset Splits	No	The paper describes generating synthetic datasets and using them in zero-shot and few-shot scenarios for anomaly detection with pre-trained LLMs. It specifies 'Number of time series per dataset: 400' and 'Number of samples per time series: 1000' but does not provide explicit training, validation, or test splits for these datasets. For few-shot, it states 'n is typically small (e.g., 1-5)' but does not detail how these examples are selected or partitioned from the generated data to ensure reproducibility of the splitting process.
Hardware Specification	No	The paper mentions using vLLM for Qwen inference and LMDeploy for Intern VL2 inference but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these inferences were run.
Software Dependencies	Yes	We use vLLM (Kwon et al., 2023) for Qwen inference and LMDeploy (Contributors, 2023) for Intern VL2 inference. We perform experiments using four state-of-the-art M-LLMs, two of which are open-sourced: Qwen-VL-Chat (Bai et al., 2023) and Intern VL2-Llama3-76B (Chen et al., 2024), and two of which are proprietary: GPT-4o-mini (Open AI, 2024) and Gemini-1.5-Flash (Google, 2024). The GPT-4o-mini variant we used in this work is gpt-4o-2024-08-06. Gemini-1.5-Flash is proprietary. We use the model variant gemini-1.5-flash-002. Qwen-VL-Chat is open-sourced. We use the model last updated on Jan 25, 2024. Intern VL2 is open-sourced, and we use the variant Intern VL2-Llama3-76B last updated on July 15, 2024. We utilize Matplotlib to generate visual representations of the time series data. The interpolation is performed using the interp1d function from the SciPy library with the linear method.
Experiment Setup	Yes	We incorporate two main prompting techniques in our investigation: Zero-Shot and Few-Shot Learning (FSL) and Chain-of-Thought (Co T). For FSL, we examine the LLM s ability to detect anomalies without any examples (zero-shot) and with a small number of labeled examples (few-shot). For Co T, we implement example in-context Co T templates... We explore two primary input modalities for time series data: textual and visual representations. Textual Representations. We examine several text encoding strategies to enhance the LLM s comprehension of time series data:(1) Original: Raw time series values presented as rounded spaceseparated numbers. (2) CSV: Time series data formatted as CSV (index and value per line, commaseparated)... (3) Prompt as Prefix (PAP): Including key statistics of the time series (mean, median, trend)... (4) Token per Digit (TPD): Splitting floating-point numbers into space-separated digits... Visual Representations. We utilize Matplotlib to generate visual representations of the time series data. We observe a consistent improvement in performance when interpolating the time series from 1000 steps to 300 steps, as shown in Figure 7. Notably, the top-3 best-performing text variants in all experiments typically apply such shortening. This underscores the LLM s difficulty in handling long time series, especially since the tokenizer represents each digit as a separate token. The S0.3 variant subsamples the number of data points in the time series to 30% of the original size.