What Makes a Good Feedforward Computational Graph?

Authors: Alex Vitvitskyi, João Guilherme Madeira Araújo, Marc Lackenby, Petar Veličković

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study is backed by both theoretical analyses of the metrics asymptotic behaviour for various graphs, as well as correlating these metrics to the performance of trained neural network models using the corresponding graphs.
Researcher Affiliation Collaboration 1Google Deep Mind 2University of Oxford. Correspondence to: Alex Vitvitskyi <EMAIL>.
Pseudocode Yes edges 1 j i 1 while j 0 edges < budget do ρ U(0, 1) if ρ > p then E E {(j, i)} edges edges + 1 end if j j 1 end while
Open Source Code No The paper does not explicitly state that source code for the methodology is provided, nor does it include any links to a code repository.
Open Datasets Yes As an indication of the utility of various feedforward graphs in the setting where nodes correspond to natural language tokens, we also fine-tuned Gemma 2B (Team et al., 2024a) utilising these graphs as attention masks across all Transformer layers on the standard Wikipedia dataset4 containing texts obtained from Wikipedia database dumps. 4https://www.tensorflow.org/datasets/catalog/wikipedia
Dataset Splits No The paper mentions training on lengths up to 256 and testing on sequences up to 1,024 elements, and details training steps and batch sizes. However, it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for any of the tasks or datasets used.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions optimizers like Adam W, La Prop (Ziyin et al., 2021), and RMSClip (Shazeer & Stern, 2018), and provides their hyperparameters. However, it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For maximum tasks... We use the cross-entropy loss function and the Adam W optimiser with 10 3 learning rate and 256 batch size. Training is performed over 10, 000 steps. For Parity... with only 8 layers and using standard multi-head attention with 8 heads, the vocabulary size was 2... and the embedding dimension was 256. To train the model we used the La Prop optimizer... for 1 million steps, with batch size of 128 sequences. Our hyperparameters were: Learning Rate 1 10 3, β1 0.9, β2 0.9, Weight Decay 5 10 4, RMSClip s d 1.