reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Forecasting Company Fundamentals

Authors: Felix Divo, Eric Endress, Kevin Endler, Kristian Kersting, Devendra Singh Dhami

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we try to bridge this gap and thoroughly evaluate the theoretical properties and practical performance of 24 deterministic and probabilistic company fundamentals forecasting models on real company data. We observe that deep learning models provide superior forecasting performance to classical models, in particular when considering uncertainty estimation. To validate the findings, we compare them to human analyst expectations and find that their accuracy is comparable to the automatic forecasts. We further show how these high-quality forecasts can benefit automated stock allocation.
Researcher Affiliation	Collaboration	Felix Divo EMAIL AI & ML Lab, Computer Science Department, TU Darmstadt, Germany Eric Endress EMAIL ACATIS Investment, Frankfurt am Main, Germany Kevin Endler EMAIL ACATIS Investment, Frankfurt am Main, Germany Kristian Kersting EMAIL AI & ML Lab, Computer Science Department, TU Darmstadt, Germany Hessian Center for AI (hessian.AI), Darmstadt, Germany German Research Center for Artificial Intelligence (DFKI), Darmstadt, Germany Centre for Cognitive Science, TU Darmstadt, Darmstadt, Germany Devendra Singh Dhami EMAIL Mathematics and Computer Science Departement, TU Eindhoven, Eindhoven, Netherlands
Pseudocode	No	The paper describes various models (Mean Value, ARMA, Prophet, Linear Regression, Random Forest, DLinear, NLinear, RNN, TCN, Transformer, TFT, N-BEATS, N-Hi TS, Ti DE, x LSTM-Mixer, Chronos) and their properties, but it does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	We make our implementation openly available2. 1https://github.com/amazon-science/chronos-forecasting 2https://github.com/felixdivo/Forecasting-Company-Fundamentals
Open Datasets	No	The primary quarterly company fundamentals data was obtained by integrating several proprietary data sources from S&P Global.
Dataset Splits	Yes	We used 2009 Q1 as a fixed origin and continuously expanded the training data from only four years (16 quarters) at the beginning to 13.75 years (55 quarters) at the end. We evaluated several configurations to determine the optimal lookback window. Three years (twelve quarters) was optimal for forecasting one year (four quarters). We evaluated all models on all companies and metrics for each of the 40 simulated historical forecast chunks, each consisting of IS look-back data, IS training forecasts, and OOS testing forecasts. To search hyperparameters and validate the deep learning models training success, we created a validation split by excluding 10% of the companies from training (but not from testing).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	We based our analysis on the implementation in the darts software library (Herzen et al., 2022) and the official implementation of Chronos (Ansari et al., 2024)1. We make our implementation openly available2. We used scikit-learn (Pedregosa et al., 2011) for data normalization. For evaluation, we used torchmetrics (Nicki Skafte Detlefsen et al., 2022) and the CRPS implementation of Pyro (Bingham et al., 2019).
Experiment Setup	Yes	We trained all deep learning models with gradient descent on a batch size of 64 for 100 epochs. We adjusted for different training duration requirements by early stopping after no improvement in the validation loss after three epochs. We employed the Adam W optimizer (Loshchilov & Hutter, 2018) with learning rate 10 4, weight decay weighting of 10 2, β1 = 0.9, and β2 = 0.999. Furthermore, we clipped gradients to 1.0 to stabilize training. For all models except DLinear and NLinear, we used a 10% dropout rate. The following lists the specific hyperparameters for each of the deep learning architectures. For DLinear, we used a kernel size of 10 to estimate the moving average of the trend. All recurrent neural networks (GRU, LSTM, and block variants) were three layers deep. The single models and block variants had a hidden size of 64 and 128, respectively. The TCN convolutions were set to a kernel width of 3 steps, 16 filters, and a dilation of 2 time steps. The Transformer was trained with a token size of 120, a feedforward dimension of 512, GELU activations, four encoder layers, four decoder layers, and six attention heads. TFT used a hidden size of 36 and six attention heads covering the entire time span. We learned a single block of six layers with hidden dimension 512 for N-BEATS. The N-Hi TS models consisted of three stacks, each with a single two-layer block and 512 hidden dimensions. Our configuration of Ti DE had two encoder and decoder layers each, a hidden size of 128, and a decoder with a hidden size of 32 and an output dimension of 16.