Learning Tree-Structured Composition of Data Augmentation

Authors: Dongyue Li, Kailai Chen, Predrag Radivojac, Hongyang R. Zhang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed algorithms on numerous graph and image data sets, including a multi-label graph classification data set we collected. The data set exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost (measured by GPU hours) by 43% over existing augmentation search methods while improving performance by 4.3%. Extensive experiments on contrastive learning also validate the benefit of our approach.
Researcher Affiliation Academia Dongyue Li EMAIL Northeastern University, Boston Kailai Chen EMAIL Northeastern University, Boston Predrag Radivojac EMAIL Northeastern University, Boston Hongyang R. Zhang EMAIL Northeastern University, Boston
Pseudocode Yes We summarize the complete procedure in Algorithm 1.
Open Source Code Yes Our code for reproducing the experiments is available at https://github.com/Virtuoso-Research/Tree-dataaugmentation, which also includes instructions for loading the new dataset.
Open Datasets Yes We apply our algorithm to a newly collected graph classification data set generated using Alpha Fold2 protein structure prediction APIs (Jumper et al., 2021)...Our code for reproducing the experiments is available at https://github.com/Virtuoso-Research/Tree-dataaugmentation, which also includes instructions for loading the new dataset. Next, we consider an image classification task using the i Wild Cam data set from the WILDS benchmark (Beery et al., 2021)... For contrastive learning, we consider image classification, including CIFAR-10 and a medical image data set...The sources are available online: Messidor, APTOS, and Jinchi. We also consider six graph classification data sets from TUdatasets (Morris et al., 2020), including NCI1, Proteins, DD, COLLAB, REDDIT, and IMDB
Dataset Splits Yes Table 4: We compare our algorithm with several existing data augmentation schemes on a protein graph classification data set (left) and a wildlife image classification data set (right). In particular, the left-hand side shows the average test AUROC scores for protein function prediction. The right shows the test macro F1 score on the image classification data set. We report the averaged results over five random seeds. Training Set Size 12,302 6,568 Validation Set Size 4,100 426 Testing Set Size 4,102 789 # Classes 1,198 182. For graph contrastive learning, we report the 10-fold cross-validation results in Table 6.
Hardware Specification Yes For each algorithm, we report the runtime using an Nvidia RTX 6000 GPU.
Software Dependencies No The paper mentions software components like 'Python', 'PyTorch', etc., but does not list specific version numbers for any of these dependencies.
Experiment Setup Yes We use a three-layer graph neural network on graph data sets. We use pretrained Res Net-50 on image data sets. In terms of hyperparameters, we search the maximum depth d up to 4 and H between [0, 1]. For weighted training, we adjust the learning rate η between 0.01, 0.1, 1.0 and the SGD steps α between 25, 50, 100. We train a randomly initialized Wide-Res Net-28-10 on all datasets using SGD with a learning rate of 0.03 and 100,000 gradient update steps, following Xie et al. (2020).