Overcoming Order in Autoregressive Graph Generation for Molecule Generation
Authors: Edo Cohen-Karlik, Eyal Rozenberg, Daniel Freedman
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that sequential molecular graph generation models benefit from our proposed regularization scheme, especially when data is scarce. Our findings contribute to the growing body of research on graph generation and provide a valuable tool for various applications requiring the synthesis of realistic and diverse graph structures. ... In Section 5 we provide empirical evidence for the effectiveness of OLR. |
| Researcher Affiliation | Collaboration | Edo Cohen-Karlik EMAIL Verily Research School of Computer Science, Tel Aviv University Eyal Rozenberg EMAIL Verily Research Daniel Freedman EMAIL Verily Research |
| Pseudocode | No | The paper describes algorithms for generating DFS trajectories (Section 3.3, Appendix A) but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | Our empirical evaluation focuses on the former. We evaluate our proposed regularization method on the MOSES benchmark (Polykovskiy et al., 2020) and compare to relevant baselines. Our implementation is based on the work of Char RNN which use three layers of the LSTM architecture each with hidden dimension of 600 (for complete details refer to (Segler et al., 2018)). We find a consistent improvement when adding OLR to the objective of autoregressive models. The data curated by (Polykovskiy et al., 2020) is refined from the ZINC dataset (Sterling & Irwin, 2015) which contains approximately 4.6M molecules. |
| Dataset Splits | Yes | The authors provide partitions of the data into train, test and scaffold test to allow fair evaluation. ... After filtering we are left with approximately 500K molecules for training, and 55K for test and scaffold test partitions. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like "Long Short-Term Memory (LSTM) model", "Char RNN", "RDKit library", and "Chem Net" but does not specify their version numbers. |
| Experiment Setup | Yes | We employed a Long Short-Term Memory (LSTM) model with a hidden width of 100, trained as a regression task. During training, we used graphs containing 10 nodes and a training set consisting of 50 examples; our aim was to determine the effectiveness of OLR in the case when data is extremely scarce. The network was trained until convergence with perfect training accuracy and evaluated on a test set consisting of 200 data points. ... Our implementation is based on the work of Char RNN which use three layers of the LSTM architecture each with hidden dimension of 600 (for complete details refer to (Segler et al., 2018)). ... We use 1000 randomly sampled data points from the training set and evaluate over the entire test set. When training with small amounts of data there is a tradeoff between the validity of the generated molecules and the uniqueness and other metrics. Our evaluation considers the best performing models for each method providing the validity of the generated molecules exceeds 80%. |