reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks

Authors: Sadegh Mahdavi, Kevin Swersky, Thomas Kipf, Milad Hashemi, Christos Thrampoulidis, Renjie Liao

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we argue that OOD generalization in this setting is significantly different than common OOD settings. For example, some phenomena in OOD generalization of image classifications such as accuracy on the line are not observed here, and techniques such as data augmentation methods do not help as assumptions underlying many augmentation techniques are often violated. Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS (Veličković et al., 2021), which contains 30 algorithmic reasoning tasks. We propose several solutions, including a simple-yet-effective fix to the input distribution shift and improved data generation. Finally, we propose an attention-based 2WL-graph neural network (GNN) processor which complements message-passing GNNs so their combination outperforms the state-of-the-art model by a 3% margin averaged over all algorithms. Our code is available at: https://github.com/smahdavi4/clrs.
Researcher Affiliation	Collaboration	Sadegh Mahdavi EMAIL University of British Columbia Kevin Swersky EMAIL Google Research, Brain Team Thomas Kipf EMAIL Google Research, Brain Team Milad Hashemi EMAIL Google Research, Brain Team Christos Thrampoulidis EMAIL University of British Columbia Renjie Liao EMAIL University of British Columbia
Pseudocode	No	The paper describes the 'Encode-Process-Decode' framework and provides a simplified formula (Eq 1) for the processor: 'H(t+1) = GNN([ H(t), Zinp ], A), (1)'. It explains the process for Breadth-first Search (BFS) in detail. However, it does not contain a distinct, clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps in a code-like format.
Open Source Code	Yes	Our code is available at: https://github.com/smahdavi4/clrs.
Open Datasets	Yes	Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS (Veličković et al., 2021), which contains 30 algorithmic reasoning tasks.
Dataset Splits	Yes	Table 3: Summary of dataset configurations used in this paper. Dataset Lengths Dataset Size Dataset Generation Train/Val Test Train Val/Test CLRS(Veličković et al., 2021) 16 64 1000 32 Erdős Rényi graphs with fixed p L-CLRS 16 64 100 000 32 Erdős Rényi graphs with fixed p L-CLRS-Len 16 32 100 000 1000 K-Regular graphs with K = 4 L-CLRS-Deg 32 32 100 000 1000 K-Regular graphs with K = 4 for train/val, and K = 8 for test L-CLRS-Len-Deg 16 32 100 000 1000 K-Regular graphs with K = 4 for train/val, and K = 8 for test
Hardware Specification	No	Acknowledgments This work was funded, in part, by NSERC DG Grants (No. RGPIN-2022-04636 and No. RGPIN-2021-03677), Google Cloud credits, and Oracle Cloud credits. Resources used in preparing this research were provided, in part, by Google, Advanced Research Computing at the University of British Columbia, the Oracle for Research program, and the Digital Research Alliance of Canada (alliancecan.ca). The paper mentions using 'Google Cloud credits' and 'Oracle Cloud credits' and other computing resources, but it does not specify any particular GPU models, CPU types, or memory configurations used for the experiments.
Software Dependencies	No	For the rest of the experiments, we use a batch size of 32 for message-passing processors and 16 for high-order processors, gradient clipping to a norm 1.0, a learning rate of 0.0001, 20 000 optimization steps, and cosine annealing scheduler. The paper describes various hyperparameters and training strategies but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	For the rest of the experiments, we use a batch size of 32 for message-passing processors and 16 for high-order processors, gradient clipping to a norm 1.0, a learning rate of 0.0001, 20 000 optimization steps, and cosine annealing scheduler. Moreover, as the number of processing steps is meaningless in the no-hint version, we set this number to 32 (i.e. 32 recurrent steps of message passing).