reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FreqLLM: Frequency-Aware Large Language Models for Time Series Forecasting

Authors: Shunnan Wang, Min Gao, Zongwei Wang, Yibing Bai, Feng Jiang, Guansong Pang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark datasets demonstrate that Freq LLM outperforms state-of-the-art TSF methods in both accuracy and generalization.
Researcher Affiliation	Academia	1Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education 2School of Big Data and Software Engineering, Chongqing University 3School of Computing and Information Systems, Singapore Management University
Pseudocode	No	The paper describes the methodology in text and through diagrams (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/biya0105/Freq LLM.
Open Datasets	Yes	For the long-term forecasting experiments, we test using a variety of datasets, including the Electricity Transformer Temperature (ETT) dataset [Zhou et al., 2021], as well as weather and traffic datasets [Wu et al., 2023], which are widely used for evaluating the long-term forecasting performance of time series models. For short-term experiments, we primarily utilize the M4 benchmark dataset [Makridakis et al., 2018], which consists of time series data from annual, quarterly, monthly, and other categories, featuring large scale, wide coverage, and high-quality data.
Dataset Splits	Yes	We used a unified pipeline following the experimental configurations of all baselines [Wu et al., 2023]. In these experiments, we use the top 5% and 10% of the training data.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using GPT-2 as the backbone model and the Adam optimizer, but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	Our method is trained with MSE loss, using the Adam [Kinga et al., 2015] optimizer with an initial learning rate of 10 2. We maintain the backbone model at 32 layers. We set the patch dimension dm to 16, the number of heads M to 8, the semantic exemplars size V to 1000, the loss weight ̸ to 0.08, the sliding window size to 8, and the prompt length K to 8.