reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning

Authors: Yihe Deng, Ruochi Zhang, Pan Xu, Jian Ma, Quanquan Gu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of Phy GCN through various evaluations across multiple tasks and datasets, showing its advantage over state-of-the-art hypergraph learning methods. Notably, Phy GCN has been applied to study multi-way chromatin interactions and polypharmacy side-effects network data, confirming its advantages in producing enhanced node embeddings and modeling higher-order interactions.
Researcher Affiliation	Academia	Yihe Deng EMAIL Department of Computer Science University of California, Los Angeles; Ruochi Zhang EMAIL Eric and Wendy Schmidt Center Broad Institute of MIT and Harvard; Pan Xu EMAIL Department of Biostatistics and Bioinformatics Duke University; Jian Ma EMAIL Ray and Stephanie Lane Computational Biology Department School of Computer Science Carnegie Mellon University; Quanquan Gu EMAIL Department of Computer Science University of California, Los Angeles
Pseudocode	No	The paper describes methods textually and mathematically in Section 3, 3.1, 3.2, and 3.3, and uses figures (Figure 1 and 2) to illustrate workflows and architectures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We evaluated these methods using the benchmark node classification task on citation networks (including Citeseer, Cora, DBLP, and Pub Med) whose hyperedges denote either co-authorship or co-citation relationships. [...] we use the SPRITE data (Quinodoz et al., 2018) from the GM12878 lymphoblastoid human cell line following the same processing procedure in MATCHA (Zhang & Ma, 2020). [...] We use the polypharmacy side-effect dataset (Zitnik et al., 2018)
Dataset Splits	Yes	To evaluate Phy GCN against the baselines on a wider range of training data ratios, we generated data splits with increasing training data ratios of 1%, 2%, 4%, 8% and 16%. [...] Specifically, we conduct four experiments on each dataset for each model, incrementally increasing the training data ratio from 0.5% to 16%. [...] In Fig. 4a, a split by chromosome indices is performed, where nodes odd chromosome indices constitute the training data, and those from even indices form the test data. For cross-validation, a swapping of the training and test set is conducted, followed by the reporting of the average accuracy for each method.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions optimizers like Adam and SGD, and refers to 'Cooler software' for calculating A/B compartment scores, but does not provide specific version numbers for these or any other key software libraries or programming languages used in the experiments.
Experiment Setup	Yes	For Phy GCN, we consider three hypergraph convolutional layers, each with a uniform hidden layer size of 128 and an output embedding size of 64. In the multi-head attention layer used for pre-training, we set the number of heads to 8, aligning with setting in Zhang et al. (2020). The number of training epochs is set to be 300, and the training batch of the pre-training task contains 32 positive samples while we generate 5 negative samples with regard to each positive sample. We use Adam optimizer of learning rate 0.001 with weight decay 5e-4. The dropout rate is set to 0.5, and the coefficient for dropedge is set to 0.3: masking 30% of the values of the adjacency matrix b A at each training iteration.