reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ALTA: Compiler-Based Analysis of Transformers

Authors: Peter Shaw, James Cohan, Jacob Eisenstein, Kenton Lee, Jonathan Berant, Kristina Toutanova

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also propose tools to analyze cases where the expressibility of an algorithm is established, but end-to-end training on a given training set fails to induce behavior consistent with the desired algorithm. To this end, we explore training from ALTA execution traces as a more ﬁne-grained supervision signal. This enables additional experiments and theoretical analyses relating the learnability of various algorithms to data availability and modeling decisions, such as positional encodings. We detail experiments and analysis on several tasks, with further details and results in Appendix D.
Researcher Affiliation	Industry	Peter Shaw1, James Cohan2, Jacob Eisenstein1, Kenton Lee1, Jonathan Berant1, Kristina Toutanova1 1Google Deep Mind, 2Google
Pseudocode	Yes	Figure 2: Example ALTA Program. The parity program computes whether a given binary sequence contains an even or odd number of 1 tokens. For an input of length N, the parity variable of the ﬁnal input element will equal the parity of the overall sequence after N 1 layers, and computation will halt. The program speciﬁcation contains all of the necessary information to compile the program to a Transformer.
Open Source Code	Yes	We make the ALTA framework language speciﬁcation, symbolic interpreter, and weight compiler available to the community to enable further applications and insights.1 1Code is available at https://github.com/google-deepmind/alta.
Open Datasets	Yes	The SCAN (Lake & Baroni, 2018) suite of compositional generalization tasks requires mapping natural language commands (e.g., jump twice ) to action sequences (e.g., JUMP JUMP).
Dataset Splits	Yes	The train and test sets are the same for all experiments (including both trace and end-to-end supervision). The train set consists of examples between lengths 0 and 20, and the test set contains examples between lengths 0 and 40. The sets include roughly an equal number of examples per number of ones.
Hardware Specification	No	The paper mentions training models and memory constraints but does not specify any particular hardware components like CPU or GPU models used for the experiments.
Software Dependencies	No	The paper mentions optimizers like Adam (Diederik, 2014) and Adafactor (Shazeer & Stern, 2018), activation functions like Ge LU (Hendrycks & Gimpel, 2016), and a tokenizer like Sentence Piece (Kudo & Richardson, 2018). However, it does not provide specific version numbers for any of these software components or for core programming languages/libraries like Python or PyTorch.
Experiment Setup	Yes	Table 1: Trace supervision hyperparameters. Hyperparameter Sequential (Relative) Sequential (Absolute) Sum + Modulo Hidden Layers 2 4 4 Hidden Layer Size 128 4,096 4,096 Batch Size 256 256 256 Steps 50,000 50,000 400,000 Learning Rate 1e-2 1e-4 1e-4 Activation Fn Re LU Re LU Re LU Optimization Fn Adafactor Adam Adam Noise Std Dev 0.1 0.1 0.1