reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieval Augmented Time Series Forecasting

Authors: Sungwon Han, Seungeon Lee, Meeyoung Cha, Sercan O Arik, Jinsung Yoon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.
Researcher Affiliation	Collaboration	1Department of AI Convergence, GIST, Gwangju, South Korea. This work is done while the author was in KAIST. 2School of Computing, KAIST, Daejeon, South Korea 3Data Science for Humanity Group, Max Planck Institute for Security and Privacy, Bochum, Germany 4Google Cloud AI, Sunnyvale, United States. Correspondence to: Jinsung Yoon <EMAIL>.
Pseudocode	No	The paper describes algorithms using mathematical equations and textual descriptions (e.g., Eq. 1-9), but does not present them in a clearly labeled 'Pseudocode' or 'Algorithm' block format.
Open Source Code	Yes	1Code is in https://github.com/archon159/RAFT
Open Datasets	Yes	We consider ten different benchmark datasets, each with a diverse range of variates, dataset lengths, and frequencies: (1-4) The ETT dataset... (5) The Electricity dataset records household electric power consumption over approximately 4 years (Trindade, 2015); (6) The Exchange dataset includes the daily exchange rates of eight countries over 27 years (1990 2016) (Lai et al., 2018); (7) The Illness dataset includes the weekly ratio of patients with inﬂuenza-like illness over 20 years (2002-2021)4; (8) The Solar dataset contains 10-minute solar power forecasts collected from power plants in 2006 (Liu et al., 2022a); (9) The Trafﬁc dataset contains hourly road occupancy rates on freeways over 48 months5; and (10) The Weather dataset consists of 21 weather-related indicators in Germany over one year6.
Dataset Splits	Yes	The dataset size is presented in (Train, Validation, Test). The detailed information of each dataset are shown in Table 5. Table 5. Basic information of datasets used for evaluation. Dataset # of variates Dataset Size Frequency ETTh1 7 (8449, 2785, 2785) Hourly
Hardware Specification	Yes	For all experiments, the average results from three runs are reported, with each experiment conducted on a single NVIDIA A100 40GB GPU.
Software Dependencies	No	For implementation, we referred to the publicly available time-series repository (TSLib). The paper does not provide specific version numbers for software dependencies like Python, PyTorch, or the TSLib library itself.
Experiment Setup	Yes	RAFT employs the retrieval module with following detailed settings. The periods are set to {1, 2, 4} (n = 3), following existing literature (Wang et al., 2024), and the temperature τ is set to 0.1. Batch size is set to 32. The initial learning rate, the number of patches used in the retrieval (m), and the size of the look-back window (L) are determined via grid search based on performance on the validation set, following the prior work (Wang et al., 2024). For fair comparison, hyper-parameter tuning was performed for both our model and all baselines using the validation set. The learning rate is chosen from 1e-5 to 0.05, look back window size from {96, 192, 336, 720}, and the number of patches used in retrieval m from {1, 5, 10, 20}.