Scalable Hierarchical Self-Attention with Learnable Hierarchy for Long-Range Interactions
Authors: Thuan Nguyen Anh Trang, Khang Nhat Ngo, Hugo Sonnery, Thieu Vo, Siamak Ravanbakhsh, Truong Son Hy
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we report state-of-the-art performance on long-range graph benchmarks while remaining computationally efficient. Moving beyond graphs, we also display competitive performance on long-range sequence modeling, point-clouds classification, and segmentation when using a fixed hierarchy. Our source code is publicly available at https://github.com/Hy Son Lab/Hier Attention. |
| Researcher Affiliation | Collaboration | Thuan Trang EMAIL FPT Software AI Center Nhat Khang Ngo EMAIL FPT Software AI Center Hugo Sonnery EMAIL Mc Gill University Mila Quebec AI Institute Thieu N. Vo EMAIL Ton Duc Thang University FPT Software AI Center Siamak Ravanbakhsh EMAIL Mc Gill University Mila Quebec AI Institute Truong Son Hy Truong EMAIL Indiana State University FPT Software AI Center |
| Pseudocode | Yes | Algorithm 1: Building Hierarchical Tree Algorithm 2: Bottom-up Block Algorithm 3: Top-down Block Algorithm 4: Entire Network |
| Open Source Code | Yes | Our source code is publicly available at https://github.com/Hy Son Lab/Hier Attention. |
| Open Datasets | Yes | To demonstrate the effectiveness of our hierarchical inductive bias, we apply Sequoia to three graph modeling datasets, namely LRGB (Dwivedi et al., 2022c), Polymer (St. John et al., 2019), and Citation network (Sen et al., 2008). We evaluate Sequoia on Long Range Arena (Tay et al., 2020b) (sequence classification on multiple input modalities). Shape classification: Model Net40 (Wu et al., 2015) dataset is a classification dataset that contains 12,311 3D models categorised into 40 classes. Part segmentation: Shape Net Part (Mo et al., 2019) dataset is a part segmentation dataset, which contains 16,881 synthetic point clouds from 16 classes. |
| Dataset Splits | No | The paper describes various datasets used for experiments but does not explicitly provide information on the training, validation, or test splits (e.g., percentages, sample counts, or specific predefined split references) within the main text. |
| Hardware Specification | Yes | For Tables 2, 3, 4, we reran all the models in the benchmark using the same framework Graph GPS (Rampášek et al., 2022) on same devices (RTX3090) for fair comparison in terms of runtime and number of parameters. |
| Software Dependencies | No | The paper mentions using 'Graph GPS' and 'PyTorch Geometric' for evaluation and comparison but does not provide specific version numbers for these or other software dependencies, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | The same configuration is shared across all datasets. In particular, we set batch size = 128, hidden dimension d = 96, tree s layer Λ = 3, stochastic gradient descent optimizer with the initial learning rate of 10 4, maximum number of clusters C = 32, and epoch = 200. More specifically, our model consists of four transformer layers, with a hidden dimension of 512 splits across settings for all datasets. We use a dropout rate of 0.1, and our model is optimized by Adam optimizer with a learning rate of 0.001 with a linear warmup. In all experiments, our model use four Transformer layers and 4-layer tree. The other configurations are kept the same as other works. In particular, the optimizer used in both experiments is SGD (learning rate = 0.05, momentum = 0.9, and weight decay = 0.0001) and we train the model for 200 epochs with a batch size of 32 for classification and 16 for segmentation. |