Learning Randomized Algorithms with Transformers

Authors: Johannes von Oswald, Seijin Kobayashi, Yassir Akram, Angelika Steger

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks computation and predictions.
Researcher Affiliation Collaboration Johannes von Oswald 1, Seijin Kobayashi 1,2, Yassir Akram 2, Angelika Steger2 1 Paradigms of Intelligence Team, Google 2 Department of Computer Science, ETH Zürich
Pseudocode No The paper describes theoretical considerations and specific algorithmic steps in paragraph form, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures with structured pseudocode.
Open Source Code No The paper does not include an unambiguous statement or link indicating that the source code for the methodology described is publicly available or released.
Open Datasets No The paper mentions 'Randomly generated binary value vectors' for associative recall, 'Random permutation of graph indices' for graph coloring, and describes the setup for grid worlds. It does not provide concrete access information (links, DOIs, specific citations with author/year, or established benchmark dataset names) for publicly available datasets used in its experiments.
Dataset Splits No The paper describes how the data is generated or structured for each task (e.g., 'Context size Variable size from 8-20', 'N = 10 vertices', '5x5 grid'). However, it does not explicitly provide information about dataset splits (e.g., percentages or counts for training, validation, and test sets) for reproducibility.
Hardware Specification Yes We provide here additional results and training details to reproduce the experiments one by one, as well as a proof of proposition 1. We give first a short overview of the compute budget used for this project. For the associative recall task as well as the graph coloring problem, we estimate a total compute budget using 4 Nvidia RTX 4090 for a month. For the grid world problem, we use a cluster of 16x A100 GPUs which we used to scan over the hyperparameters on and off over a total of 2 weeks.
Software Dependencies No The paper mentions the Adam optimizer (Kingma & Ba, 2015) and Parameter-exploring Policy Gradients (PEPG) (Sehnke et al., 2010), which are algorithms/methods. However, it does not list any specific software libraries or programming languages with version numbers (e.g., Python 3.8, PyTorch 1.9) required for replication.
Experiment Setup Yes Table 1: Hyperparameters for the associative recall task. Table 2: Hyperparameters for the graph coloring task. Table 3: Hyperparameters for the grid world task.