Transformers Struggle to Learn to Search

Authors: Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We demonstrate experimentally that transformers can indeed be taught to search, when given the right training distribution. We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation.
Researcher Affiliation Collaboration 1Purdue University, 2New York University, 3Google, 4Boston University
Pseudocode Yes Pseudocode for this procedure is shown in Algorithm 1. A.3 ALGORITHM RECONSTRUCTION PSEUDOCODE Algorithm 1: Pseudocode of the procedure to compute the set of explainable attention operations of a transformer M for a given input x.
Open Source Code Yes 1All code for generating data, training and evaluation is open-source and freely available at github.com/asaparov/learning to search.
Open Datasets No In order to test whether transformers can learn to perform search, we need a way to produce a large number of search problems on which the model can be trained and evaluated. We do so by generating search problems on directed acyclic graphs (DAGs).
Dataset Splits Yes The first few batches are reserved for testing, and subsequent batches are filtered via exact matching to remove any examples that appear in the test set, to ensure that the examples in the test set are unseen. In all our experiments, we use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout. We use a batch size of 1024 examples, unless otherwise stated. We minimize the cross-entropy loss during training. Some graphs contain multiple paths from the start to the goal vertex. During training, we select one path uniformly at random as the ground truth when computing the loss. During evaluation, we consider the model s prediction to be correct if the output vertex lies on any path to the goal.
Hardware Specification No This work was supported by Open Philanthropy, and in part through the NYU IT High Performance Computing resources, services, and staff expertise, and the Rosen Center for Advanced Computing at Purdue University.
Software Dependencies No We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation. We use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout.
Experiment Setup Yes We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation. In order to facilitate mechanistic interpretation of the trained model behavior, we use 1-hot embeddings for both the token and the absolute positional embedding. Furthermore, the token embedding and positional embeddings are concatenated rather than added to form the transformer input. We use 1 attention head per layer. The feed-forward dimension is the same as the model dimension. Since the edges in our inputs are randomly ordered, it would help for each token to be able to attend to any other token, rather than only the preceding tokens. As such, the model does not use a causal mask when computing attention. We train an 6-layer model with hidden dimension 16. To simulate training on samples of a very large corpus of data, we utilize streaming training. We continually sample batches from the generative process throughout training, instead of sampling batches from a fixed training set. The first few batches are reserved for testing, and subsequent batches are filtered via exact matching to remove any examples that appear in the test set, to ensure that the examples in the test set are unseen. In all our experiments, we use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout. We use a batch size of 1024 examples, unless otherwise stated. We minimize the cross-entropy loss during training.