Fourier Head: Helping Large Language Models Learn Complex Probability Distributions
Authors: Nate Gillman, Daksh Aggarwal, Michael Freeman, Chen Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. ... For example, the Fourier head improves a Decision Transformer agent s returns across four benchmark Atari games by as much as 377%, and increases a state-of-the-art times series foundation model s forecasting performance by 3.5% across 20 benchmarks unseen during training. |
| Researcher Affiliation | Collaboration | Nate Gillman*,1 , Daksh Aggarwal*,1, Michael Freeman1, Saurabh Singh2, Chen Sun1,2 1Brown University, 2Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Fourier head Hyperparameters: the input dimension n, output dimension m, number of frequencies N |
| Open Source Code | Yes | We release our implementation at https://nategillman.com/fourier-head. Additionally, we have released the research code on Git Hub: https://github.com/nate-gillman/ fourier-head. |
| Open Datasets | Yes | Dataset: We use the same dataset from the original Decision Transformer implementation (Chen et al., 2021). ... Dataset: We use the same training dataset for large-scale pretraining that Ansari et al. (2024) used. We gather an evaluation benchmark of 20 time series datasets which were not seen during training. These 20 come from the zero-shot eval from (Ansari et al., 2024). ... See Table 9 for the datasets we used to train and evaluate Chronos. |
| Dataset Splits | Yes | Dataset: We create 3 synthetic datasets... our dataset has size 5000, with a 80-20 split between the train and test set. ... Dataset: We use the same dataset from the original Decision Transformer implementation (Chen et al., 2021). ... Dataset: We use the same training dataset for large-scale pretraining that Ansari et al. (2024) used. We gather an evaluation benchmark of 20 time series datasets which were not seen during training. These 20 come from the zero-shot eval from (Ansari et al., 2024). |
| Hardware Specification | No | Our research was conducted using computational resources at the Center for Computation and Visualization at Brown University. Following the original Decision Transformer implementation, we trained on 500k transitions observed by a DQN agent during training, for 5 epochs. We trained on the same model size as the original implementation (a GPT-1 model with approximately 2.012M parameters) which takes about 4 hours on a single GPU. We followed the original Chronos implementation, keeping all hyperparameters the same. In particular, we trained for 200k steps, on the same model size as the original implementation (the T5 model with approximately 20M parameters) and this takes about 48 hours on 8 GPUs. |
| Software Dependencies | No | In Py Torch, linear layers use He initialization (He et al., 2015) by default, which ensures that the linear layer outputs values close to zero in expectation. |
| Experiment Setup | Yes | We sweep over frequencies N = 2, 4, . . . , 20, and consider regularization γ {0, 10 6}. We train those models via cross-entropy loss. We conduct Lo RA fine-tuning for 16 epochs with a learning rate of 3 10 4 and a linear decay schedule, and a batch size of 64. In our experiments we consider frequencies N {2, 4, 6, 8, . . . , 30, 32}. For our study, we will replace this with a Fourier head with frequencies N = 64, 128, 256, 550. We use mixed precision binning; this is informed by an analysis of the Fourier spectrum of the next-token distribution, as described in Section 2.3. We also use Fourier weight decay regularization with γ = 10 6. |