Alternators For Sequence Modeling

Authors: Mohammad Reza Rezaei, Adji Bousso Dieng

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the capabilities of alternators in three applications. We first used alternators to model the Lorenz equations, often used to describe chaotic behavior. We then applied alternators to Neuroscience, to map brain activity to physical activity. Finally, we applied alternators to Climate Science, focusing on sea-surface temperature forecasting. In all our experiments, we found alternators are stable to train, fast to sample from, yield high-quality generated samples and latent variables, and often outperform strong baselines such as Mambas, neural ODEs, and diffusion models in the domains we studied.
Researcher Affiliation Collaboration Mohammad R. Rezaei EMAIL University of Toronto Vertaix Adji Bousso Dieng EMAIL Department of Computer Science, Princeton University Vertaix
Pseudocode Yes Algorithm 1: Dynamical Generative Modeling with Alternators Inputs: Samples from p(x1:T ), batch size B, variances σ2 x and σ2 z, schedule α1:T Initialize model parameters θ and ϕ while not converged do for b = 1, . . . , B do Draw initial latent z(b) 0 N(0, IDz) for t = 1, . . . , T do Draw noise variables ϵ(b) xt N(0, IDx) and ϵ(b) zt N(0, IDz) Draw x(b) t = p (1 σ2x) fθ(z(b) t 1) + σx ϵ(b) xt Draw z(b) t = αt gϕ(x(b) t ) + p (1 αt σ2z) z(b) t 1 + σz ϵ(b) zt end end Compute loss L(θ, ϕ) in Eq. 12 using z1:B 0:T and data samples from p(x1:T ) Backpropagate to get θL(θ, ϕ) and ϕL(θ, ϕ) Update parameters θ and ϕ using stochastic optimization, e.g. Adam. end
Open Source Code Yes For comprehensive details regarding implementation specifics and hyperparameter configurations across each experiment, we refer the reader to the Appendix B (code available at: https://github.com/vertaix/Alternators).
Open Datasets Yes We refer the reader to Glaser et al. (2020; 2018) for more details on how these data were collected. The SST dataset we consider here is the NOAA OISSTv2 dataset, which comprises daily weather images with high-resolution SST data from 1982 to 2021 (Huang et al., 2021).
Dataset Splits Yes We use the first 70% of each recording for training and the remaining 30% as the test set. We used data from 1982 to 2019 (15,048 data points) for training, data from the year 2020 (396 data points) for validation, and data from 2021 (396 data points) for testing.
Hardware Specification Yes All SST experiments were conducted on NVIDIA A6000 GPUs with 48GB of memory, enabling efficient processing of the high-dimensional spatial-temporal inputs essential for accurate SST forecasting.
Software Dependencies No The paper mentions using the Adam optimizer and comparing against models like Mambas, neural ODEs, and diffusion models, but does not specify software versions for programming languages or libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For the Lorenz attractor experiments, we employed a 2-layer attention-based architecture for both the Observation Transition Network (OTN) and Feature Transition Network (FTN). Each attention layer was followed by a hidden layer containing 10 units, providing sufficient capacity to capture the complex chaotic dynamics while maintaining computational efficiency. The noise variance parameters were carefully tuned through grid search optimization σz, σx [0.01, 0.8]. We find the latent noise variance σz = 0.1 and observation noise variance σx = 0.3 as the best choices. The alternation parameter αt = 0.3 was kept fixed across all time steps to maintain consistent switching dynamics between the forward and backward processes. In this experiment, models were trained for 500 epochs using the Adam optimizer with an initial learning rate of 0.01. We applied a cosine annealing learning rate scheduler that reduced the learning rate to a minimum of 1 10 4 over the training period, with 10 warm-up epochs to stabilize initial training dynamics.