reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

Authors: Nate Gillman, Daksh Aggarwal, Michael Freeman, Chen Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. ... For example, the Fourier head improves a Decision Transformer agent s returns across four benchmark Atari games by as much as 377%, and increases a state-of-the-art times series foundation model s forecasting performance by 3.5% across 20 benchmarks unseen during training.
Researcher Affiliation	Collaboration	Nate Gillman,1 , Daksh Aggarwal,1, Michael Freeman1, Saurabh Singh2, Chen Sun1,2 1Brown University, 2Google Deep Mind
Pseudocode	Yes	Algorithm 1 Fourier head Hyperparameters: the input dimension n, output dimension m, number of frequencies N
Open Source Code	Yes	We release our implementation at https://nategillman.com/fourier-head. Additionally, we have released the research code on Git Hub: https://github.com/nate-gillman/ fourier-head.
Open Datasets	Yes	Dataset: We use the same dataset from the original Decision Transformer implementation (Chen et al., 2021). ... Dataset: We use the same training dataset for large-scale pretraining that Ansari et al. (2024) used. We gather an evaluation benchmark of 20 time series datasets which were not seen during training. These 20 come from the zero-shot eval from (Ansari et al., 2024). ... See Table 9 for the datasets we used to train and evaluate Chronos.
Dataset Splits	Yes	Dataset: We create 3 synthetic datasets... our dataset has size 5000, with a 80-20 split between the train and test set. ... Dataset: We use the same dataset from the original Decision Transformer implementation (Chen et al., 2021). ... Dataset: We use the same training dataset for large-scale pretraining that Ansari et al. (2024) used. We gather an evaluation benchmark of 20 time series datasets which were not seen during training. These 20 come from the zero-shot eval from (Ansari et al., 2024).
Hardware Specification	No	Our research was conducted using computational resources at the Center for Computation and Visualization at Brown University. Following the original Decision Transformer implementation, we trained on 500k transitions observed by a DQN agent during training, for 5 epochs. We trained on the same model size as the original implementation (a GPT-1 model with approximately 2.012M parameters) which takes about 4 hours on a single GPU. We followed the original Chronos implementation, keeping all hyperparameters the same. In particular, we trained for 200k steps, on the same model size as the original implementation (the T5 model with approximately 20M parameters) and this takes about 48 hours on 8 GPUs.
Software Dependencies	No	In Py Torch, linear layers use He initialization (He et al., 2015) by default, which ensures that the linear layer outputs values close to zero in expectation.
Experiment Setup	Yes	We sweep over frequencies N = 2, 4, . . . , 20, and consider regularization γ {0, 10 6}. We train those models via cross-entropy loss. We conduct Lo RA fine-tuning for 16 epochs with a learning rate of 3 10 4 and a linear decay schedule, and a batch size of 64. In our experiments we consider frequencies N {2, 4, 6, 8, . . . , 30, 32}. For our study, we will replace this with a Fourier head with frequencies N = 64, 128, 256, 550. We use mixed precision binning; this is informed by an analysis of the Fourier spectrum of the next-token distribution, as described in Section 2.3. We also use Fourier weight decay regularization with γ = 10 6.