reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring Representations and Interventions in Time Series Foundation Models

Authors: Michał Wiliński, Mononito Goswami, Willa Potosnak, Nina Żukowska, Artur Dubrawski

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of our proposed pruning strategy, we explore two pruning configurations, one in which we prune all redundant blocks, and the other where we prune only a single block. We compare the performance of these pruned models to the original, unpruned TSFMs using standard task-specific accuracy metrics (Mean Squared Error and Mean Absolute Error) and efficiency metrics (inference time in milliseconds and theoretical model size in megabytes). We evaluate these models on widely used imputation (Zhou et al., 2021) and forecasting (Ansari et al., 2024) benchmarks in both zero-shot settings and after linear probing (Goswami et al., 2024).
Researcher Affiliation	Academia	1Auton Lab, School of Computer Science, Carnegie Mellon University. Correspondence to: Michał Wiliński <EMAIL>.
Pseudocode	Yes	Algorithm 1 Block-wise Pruning (Skipping Computation) Require: Trained model M with layers {l1, l2, . . . , ln}; Identified redundant blocks B = {b1, b2, . . . , bk} for each block bi in B do Let bi consist of layers ls to le {Block edges at ls and le} for layer index j = s + 1 to e 1 do Remove layer lj from model M end for end for return pruned model M
Open Source Code	Yes	To ensure reproducibility, we have made our code anonymously accessible through Git Hub. All time series foundation models used for our analysis are publicly available and open-source: Chronos, MOMENT, and MOIRAI.
Open Datasets	Yes	We are grateful to the creators and maintainers of the UCR Time-Series Archives (classification & anomaly) (Dau et al., 2018; Keogh et al., 2021), the Monash Forecasting Archive team (Godahewa et al., 2021), the Informer long-horizon data providers (ETT, Electricity, Exchange Rate, Weather, Traffic, ILI) (Zhou et al., 2021), the custodians of TSB-UAD (Paparrizos et al., 2022), and every other public dataset we relied on.
Dataset Splits	Yes	We evaluate these models on widely used imputation (Zhou et al., 2021) and forecasting (Ansari et al., 2024) benchmarks in both zero-shot settings and after linear probing (Goswami et al., 2024). While prior work such as (Nguyen et al., 2021) primarily focused on pruning individual layers, our approach differs by targeting entire blocks of self-similar layers, preserving boundary layers to maintain representational continuity. Furthermore, we extend block-level pruning to large pretrained TSFMs, demonstrating practical effectiveness on diverse real-world tasks beyond classification, achieving substantial inference speedups (up to 52%) with minimal performance degradation, even in zero-shot scenarios. Table 3: Zero-shot imputation performance of MOMENT. Results averaged across four different masking rates: {12.5%, 25%, 37.5%, 50%} and five runs with different masking seeds. Table 6: Fine-tuned forecasting performance of MOMENT. This table presents model performance metrics (Mean Absolute Error and Mean Squared Error) on a subset of the Time Series Pile (Goswami et al., 2024; Zhou et al., 2021). Metrics are presented for MOMENT-Large without any pruning (Vanilla) and for all blocks pruned (All). Results are gathered from the best performance on the test set across 3 epochs of training with a batch size 64 and learning rate of 0.0001.
Hardware Specification	Yes	All models were trained and evaluated on a computing cluster consisting of 128 AMD EPYC 7502 CPUs, 503 GB of RAM, and 8 NVIDIA RTX A6000 GPUs, each with 49 Gi B RAM.
Software Dependencies	No	The paper acknowledges "the open-source community authors of the libraries, frameworks, and tooling that make modern machine learning research possible," but does not specify any particular software dependencies with version numbers.
Experiment Setup	Yes	For Chronos, steering required tuning the parameter λ ≈ 0.1 for effective performance, whereas MOMENT maintained effective steering with λ = 1. Results are gathered from the best performance on the test set across 3 epochs of training with a batch size 64 and learning rate of 0.0001.