reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intrinsic Dimension for Large-Scale Geometric Learning

Authors: Maximilian Stubbemann, Tom Hanika, Friedrich Martin Schneider

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In particular, we propose a principle way to incorporate neighborhood information, as in graph data, into the ID. This allows for new insights into common graph learning procedures, which we illustrate by experiments on the Open Graph Benchmark. We subsequently apply our method to seven real-world datasets and relate the obtained results to the observed performances of classiﬁcation procedures. Thus, we demonstrate the practical computability of our approach. In addition, we study the extent to which the intrinsic dimension reveals insights into the performance of particularly classes of Graph Neural Networks.
Researcher Affiliation	Academia	Maximilian Stubbemann EMAIL Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany Tom Hanika EMAIL Knowledge & Data Engineering Group, University of Kassel, Kassel, Germany Friedrich Martin Schneider EMAIL Institute of Discrete Mathematics and Algebra, TU Bergakademie Freiberg, Freiberg, Germany
Pseudocode	Yes	Algorithm 1: The pseudocode to compute (D) for a ﬁnite geometric dataset D = (X, µ, F). Input : Finite geometric dataset D = (X, µ, F). Output: (D) 1 forall f in F do 2 Compute feature sequence lf,D. 4 forall k in {2, . . . , \|X\|} do 5 forall f in F do 6 φk,f(D) = minj {0,...,\|X\| k} lf,D k+j lf,D 1+j. 7 (D)+ = maxf F φk,f(D) 8 (D) = 1 \|X\| (D) 9 return (D) ... Algorithm 2: The pseudocode to compute s, (D), s,+(D), (D) for a ﬁnite GD D = (X, µ, F). Input : Finite GD D = (X, µ, F), support sequence s = (2 = s1, . . . , sl = \|X\|), exact (Boolean) Output: s, (D), s,+(D), (D)
Open Source Code	Yes	Our code is publicly available on Git Hub.1 1https://github.com/mstubbemann/ID4Geo
Open Datasets	Yes	This allows for new insights into common graph learning procedures, which we illustrate by experiments on the Open Graph Benchmark. ... The statistics for Cora, Pub Med and Cite Seer were taken from Py Torch Geometric 2. The statistics of the OGB datasets were taken from the Open Graph Benchmark. 3 ... Pub Med, Cora and Cite Seer (Yang et al., 2016), which we retrieved from Py Torch Geometric (Fey & Lenssen, 2019). ... the well known, largescale ogbn-mag-papers100M dataset.
Dataset Splits	Yes	For Pub Med, Cora and Cite Seer, we train on the classiﬁcation task provided by Pytorch Geometric (Fey & Lenssen, 2019) which was earlier studied by Yang et al. (2016). All Open Graph Benchmark datasets are trained and tested on the oﬃcial node property prediction task.4 4https://ogb.stanford.edu/docs/nodeprop/
Hardware Specification	Yes	On our Xeon Gold System with 16 cores, approximating the ID of a k-hop geometric dataset build from ogbn-mag-papers100M is possible within a few hours.
Software Dependencies	No	For all tasks, we use a simple SIGN model Rossi et al. (2020) ... Implementation details and parameter choices can be found in Appendix A.1. ... For all models, we use an Adam optimizer with weight decay of 0.0001. ... We implement the MLE by using the Nearest Neighbors class of scikit-learn (Pedregosa et al., 2011).
Experiment Setup	Yes	For all tasks, we use a simple SIGN model Rossi et al. (2020) with one hidden inception layer and one classiﬁcation layer. For Pub Med, Cite Seer and Cora, we use batch sizes of 256, hidden layer size of 64 and dropout at the input and hidden layer with 0.5. The learning rate is set to 0.01. ... For ogbn-arxiv, we use a hidden dimension of 512, dropout at the input with 0.1 and with 0.5 at the hidden layer. For ogbn-mag, we use a hidden dimension of 512, do not dropout at the input and use dropout with 0.5 at the hidden layer. For ogbn-products, we use a hidden dimension of 512, input dropout of 0.3 and hidden layer dropout of 0.4. For all ogbn tasks, the learning rate is 0.001 and the batch-size 50000. For all experiments, we train for a maximum of 1000 epochs with early stopping on the validation accuracy. Here, we use a patience of 15. ... For all models, we use an Adam optimizer with weight decay of 0.0001.