reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-Context Fine-Tuning for Time-Series Foundation Models

Authors: Matthew Faw, Rajat Sen, Yichen Zhou, Abhimanyu Das

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate the benefits of in-context fine-tuning using our foundation model, and show that in-context fine-tuning can lead to better zero-shot performance on popular forecasting benchmarks as compared to supervised deep learning methods, statistical models as well as other foundation models. In particular, on a well known forecasting benchmark, comprised of 23 datasets not included in the pretaining of our foundation models, we show that our in-context fine-tuned model is 6.8% better than the base model we start from, while also being 5% better than the next best baseline.
Researcher Affiliation	Collaboration	1Georgia Institute of Technology. Part of this work was done while the author was a Student Researcher and Visiting Researcher at Google Research. 2Google Research. Correspondence to: Matthew Faw <EMAIL>.
Pseudocode	No	The paper describes the model architecture and data processing in detail, including mathematical formulations and figures (Figure 3, Figure 4) to illustrate concepts. However, it does not present any explicitly labeled pseudocode block or algorithm.
Open Source Code	No	The paper states 'Pretraining performed in a manner similar to the latest version of the Times FM Hugging Face repo.' (Footnote 4) which refers to a third-party repository for a base model. There is no explicit statement from the authors about releasing the source code for their specific methodology described in this paper.
Open Datasets	Yes	Similar to prior works, we report our results on the Chronos zero-shot benchmarks from Ansari et al. (2024), as well as rolling-window evaluation of the ETT datasets (Zhou et al., 2021). No data from these datasets (not even the training splits) was used in the training of our base model Times FM (base), or our in-context fine-tuned model Times FM-ICF. ... The Chronos zero-shot benchmark is a collection of 27 datasets of different training and prediction lengths... We give a detailed breakdown of the zero-shot evaluations on the datasets from Table 4 (displayed in Figure 5) in Table 6 with additional baselines as mentioned in Appendix A.2. We conduct the remaining evaluations ourselves, using the datasets available at this URL from the Chronos authors.
Dataset Splits	Yes	We conduct the same evaluation as in the long sequence forecasting evaluation (Woo et al., 2024) on these datasets, focusing on the task of predicting horizon lengths 96, 192, 336, and 720. We provide rolling validation numbers for the test time-period which consists the last 1/5th of the time-points. This is standard for these benchmarks (Nie et al., 2023), where the datasets are split into train:validation:test in the ratio 7:1:2.
Hardware Specification	Yes	The inference numbers are reported on TPUv5e with 8 tensor cores. All timing is reported on TPUv5e with 4 tensor cores.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with their version numbers required to replicate the experiment. It mentions deep learning frameworks and libraries implicitly (e.g., 'PyTorch' is common in such contexts), but no explicit list with versions is provided.
Experiment Setup	Yes	For all our fine-tuning runs, we use a batch size of 16 and (1) up to 10k iterations for the OOD Benchmark and (2) up to 100k for the Long Horizon ETT. We use a maximum learning rate of 1e-3, with 500-step linear warm-up and exponential decay. ... We start from the model architecture in Das et al. (2024) then create Times FM (base) with 16 attention heads, 50 layers, an input patch length of 32 and output patch length of 128. The model dimension is set to 1280. We use the learning rate schedule in (Vaswani et al., 2017) with peak learning rate of 5e-4. The hidden dims of both the residual block and the FFN in the transformer layers are set as the same as model dimensions.