reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Tree-Structured Composition of Data Augmentation

Authors: Dongyue Li, Kailai Chen, Predrag Radivojac, Hongyang R. Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the proposed algorithms on numerous graph and image data sets, including a multi-label graph classification data set we collected. The data set exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost (measured by GPU hours) by 43% over existing augmentation search methods while improving performance by 4.3%. Extensive experiments on contrastive learning also validate the benefit of our approach.
Researcher Affiliation	Academia	Dongyue Li EMAIL Northeastern University, Boston Kailai Chen EMAIL Northeastern University, Boston Predrag Radivojac EMAIL Northeastern University, Boston Hongyang R. Zhang EMAIL Northeastern University, Boston
Pseudocode	Yes	We summarize the complete procedure in Algorithm 1.
Open Source Code	Yes	Our code for reproducing the experiments is available at https://github.com/Virtuoso-Research/Tree-dataaugmentation, which also includes instructions for loading the new dataset.
Open Datasets	Yes	We apply our algorithm to a newly collected graph classification data set generated using Alpha Fold2 protein structure prediction APIs (Jumper et al., 2021)...Our code for reproducing the experiments is available at https://github.com/Virtuoso-Research/Tree-dataaugmentation, which also includes instructions for loading the new dataset. Next, we consider an image classification task using the i Wild Cam data set from the WILDS benchmark (Beery et al., 2021)... For contrastive learning, we consider image classification, including CIFAR-10 and a medical image data set...The sources are available online: Messidor, APTOS, and Jinchi. We also consider six graph classification data sets from TUdatasets (Morris et al., 2020), including NCI1, Proteins, DD, COLLAB, REDDIT, and IMDB
Dataset Splits	Yes	Table 4: We compare our algorithm with several existing data augmentation schemes on a protein graph classification data set (left) and a wildlife image classification data set (right). In particular, the left-hand side shows the average test AUROC scores for protein function prediction. The right shows the test macro F1 score on the image classification data set. We report the averaged results over five random seeds. Training Set Size 12,302 6,568 Validation Set Size 4,100 426 Testing Set Size 4,102 789 # Classes 1,198 182. For graph contrastive learning, we report the 10-fold cross-validation results in Table 6.
Hardware Specification	Yes	For each algorithm, we report the runtime using an Nvidia RTX 6000 GPU.
Software Dependencies	No	The paper mentions software components like 'Python', 'PyTorch', etc., but does not list specific version numbers for any of these dependencies.
Experiment Setup	Yes	We use a three-layer graph neural network on graph data sets. We use pretrained Res Net-50 on image data sets. In terms of hyperparameters, we search the maximum depth d up to 4 and H between [0, 1]. For weighted training, we adjust the learning rate η between 0.01, 0.1, 1.0 and the SGD steps α between 25, 50, 100. We train a randomly initialized Wide-Res Net-28-10 on all datasets using SGD with a learning rate of 0.03 and 100,000 gradient update steps, following Xie et al. (2020).