reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers Struggle to Learn to Search

Authors: Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We demonstrate experimentally that transformers can indeed be taught to search, when given the right training distribution. We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation.
Researcher Affiliation	Collaboration	1Purdue University, 2New York University, 3Google, 4Boston University
Pseudocode	Yes	Pseudocode for this procedure is shown in Algorithm 1. A.3 ALGORITHM RECONSTRUCTION PSEUDOCODE Algorithm 1: Pseudocode of the procedure to compute the set of explainable attention operations of a transformer M for a given input x.
Open Source Code	Yes	1All code for generating data, training and evaluation is open-source and freely available at github.com/asaparov/learning to search.
Open Datasets	No	In order to test whether transformers can learn to perform search, we need a way to produce a large number of search problems on which the model can be trained and evaluated. We do so by generating search problems on directed acyclic graphs (DAGs).
Dataset Splits	Yes	The first few batches are reserved for testing, and subsequent batches are filtered via exact matching to remove any examples that appear in the test set, to ensure that the examples in the test set are unseen. In all our experiments, we use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout. We use a batch size of 1024 examples, unless otherwise stated. We minimize the cross-entropy loss during training. Some graphs contain multiple paths from the start to the goal vertex. During training, we select one path uniformly at random as the ground truth when computing the loss. During evaluation, we consider the model s prediction to be correct if the output vertex lies on any path to the goal.
Hardware Specification	No	This work was supported by Open Philanthropy, and in part through the NYU IT High Performance Computing resources, services, and staff expertise, and the Rosen Center for Advanced Computing at Purdue University.
Software Dependencies	No	We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation. We use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout.
Experiment Setup	Yes	We train transformer models, with the same architecture as GPT-2 (Radford et al., 2019) with ReLU activation. In order to facilitate mechanistic interpretation of the trained model behavior, we use 1-hot embeddings for both the token and the absolute positional embedding. Furthermore, the token embedding and positional embeddings are concatenated rather than added to form the transformer input. We use 1 attention head per layer. The feed-forward dimension is the same as the model dimension. Since the edges in our inputs are randomly ordered, it would help for each token to be able to attend to any other token, rather than only the preceding tokens. As such, the model does not use a causal mask when computing attention. We train an 6-layer model with hidden dimension 16. To simulate training on samples of a very large corpus of data, we utilize streaming training. We continually sample batches from the generative process throughout training, instead of sampling batches from a fixed training set. The first few batches are reserved for testing, and subsequent batches are filtered via exact matching to remove any examples that appear in the test set, to ensure that the examples in the test set are unseen. In all our experiments, we use the Sophia optimization algorithm (Liu et al., 2024) with a learning rate of 10^-5, weight decay of 0.1, and no dropout. We use a batch size of 1024 examples, unless otherwise stated. We minimize the cross-entropy loss during training.