ALTA: Compiler-Based Analysis of Transformers
Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more fine-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We detail experiments and analysis on several tasks, with further details and results in Appendix D. |
| Researcher Affiliation | Industry | Peter Shaw1, James Cohan2, Jacob Eisenstein1, Kenton Lee1, Jonathan Berant1, Kristina Toutanova1 1Google Deep Mind, 2Google |
| Pseudocode | Yes | Figure 2: Example ALTA Program. The parity program computes whether a given binary sequence contains an even or odd number of 1 tokens. For an input of length N, the parity variable of the final input element will equal the parity of the overall sequence after N 1 layers, and computation will halt. The program specification contains all of the necessary information to compile the program to a Transformer. |
| Open Source Code | Yes | We make the ALTA framework language specification, symbolic interpreter, and weight compiler available to the community to enable further applications and insights.1 1Code is available at https://github.com/google-deepmind/alta. |
| Open Datasets | Yes | The SCAN (Lake & Baroni, 2018) suite of compositional generalization tasks requires mapping natural language commands (e.g., jump twice ) to action sequences (e.g., JUMP JUMP). |
| Dataset Splits | Yes | The train and test sets are the same for all experiments (including both trace and end-to-end supervision). The train set consists of examples between lengths 0 and 20, and the test set contains examples between lengths 0 and 40. The sets include roughly an equal number of examples per number of ones. |
| Hardware Specification | No | The paper mentions training models and memory constraints but does not specify any particular hardware components like CPU or GPU models used for the experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam (Diederik, 2014) and Adafactor (Shazeer & Stern, 2018), activation functions like Ge LU (Hendrycks & Gimpel, 2016), and a tokenizer like Sentence Piece (Kudo & Richardson, 2018). However, it does not provide specific version numbers for any of these software components or for core programming languages/libraries like Python or PyTorch. |
| Experiment Setup | Yes | Table 1: Trace supervision hyperparameters. Hyperparameter Sequential (Relative) Sequential (Absolute) Sum + Modulo Hidden Layers 2 4 4 Hidden Layer Size 128 4,096 4,096 Batch Size 256 256 256 Steps 50,000 50,000 400,000 Learning Rate 1e-2 1e-4 1e-4 Activation Fn Re LU Re LU Re LU Optimization Fn Adafactor Adam Adam Noise Std Dev 0.1 0.1 0.1 |