Scalable Hierarchical Self-Attention with Learnable Hierarchy for Long-Range Interactions

Authors: Thuan Nguyen Anh Trang, Khang Nhat Ngo, Hugo Sonnery, Thieu Vo, Siamak Ravanbakhsh, Truong Son Hy

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we report state-of-the-art performance on long-range graph benchmarks while remaining computationally efficient. Moving beyond graphs, we also display competitive performance on long-range sequence modeling, point-clouds classification, and segmentation when using a fixed hierarchy. Our source code is publicly available at https://github.com/Hy Son Lab/Hier Attention.
Researcher Affiliation Collaboration Thuan Trang EMAIL FPT Software AI Center Nhat Khang Ngo EMAIL FPT Software AI Center Hugo Sonnery EMAIL Mc Gill University Mila Quebec AI Institute Thieu N. Vo EMAIL Ton Duc Thang University FPT Software AI Center Siamak Ravanbakhsh EMAIL Mc Gill University Mila Quebec AI Institute Truong Son Hy Truong EMAIL Indiana State University FPT Software AI Center
Pseudocode Yes Algorithm 1: Building Hierarchical Tree Algorithm 2: Bottom-up Block Algorithm 3: Top-down Block Algorithm 4: Entire Network
Open Source Code Yes Our source code is publicly available at https://github.com/Hy Son Lab/Hier Attention.
Open Datasets Yes To demonstrate the effectiveness of our hierarchical inductive bias, we apply Sequoia to three graph modeling datasets, namely LRGB (Dwivedi et al., 2022c), Polymer (St. John et al., 2019), and Citation network (Sen et al., 2008). We evaluate Sequoia on Long Range Arena (Tay et al., 2020b) (sequence classification on multiple input modalities). Shape classification: Model Net40 (Wu et al., 2015) dataset is a classification dataset that contains 12,311 3D models categorised into 40 classes. Part segmentation: Shape Net Part (Mo et al., 2019) dataset is a part segmentation dataset, which contains 16,881 synthetic point clouds from 16 classes.
Dataset Splits No The paper describes various datasets used for experiments but does not explicitly provide information on the training, validation, or test splits (e.g., percentages, sample counts, or specific predefined split references) within the main text.
Hardware Specification Yes For Tables 2, 3, 4, we reran all the models in the benchmark using the same framework Graph GPS (Rampášek et al., 2022) on same devices (RTX3090) for fair comparison in terms of runtime and number of parameters.
Software Dependencies No The paper mentions using 'Graph GPS' and 'PyTorch Geometric' for evaluation and comparison but does not provide specific version numbers for these or other software dependencies, which is required for a reproducible description of ancillary software.
Experiment Setup Yes The same configuration is shared across all datasets. In particular, we set batch size = 128, hidden dimension d = 96, tree s layer Λ = 3, stochastic gradient descent optimizer with the initial learning rate of 10 4, maximum number of clusters C = 32, and epoch = 200. More specifically, our model consists of four transformer layers, with a hidden dimension of 512 splits across settings for all datasets. We use a dropout rate of 0.1, and our model is optimized by Adam optimizer with a learning rate of 0.001 with a linear warmup. In all experiments, our model use four Transformer layers and 4-layer tree. The other configurations are kept the same as other works. In particular, the optimizer used in both experiments is SGD (learning rate = 0.05, momentum = 0.9, and weight decay = 0.0001) and we train the model for 200 epochs with a batch size of 32 for classification and 16 for segmentation.