reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchical Graph Tokenization for Molecule-Language Alignment

Authors: Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks.
Researcher Affiliation	Academia	Yongqiang Chen 1 2 3 Quanming Yao 4 Juzheng Zhang 5 James Cheng 3 Yatao Bian 6 1MBZUAI 2Carnegie Mellon University 3The Chinese University of Hong Kong 4Tsinghua University 5University of Maryland, College Park 6Department of Computer Science, National University of Singapore. Correspondence to: Yatao Bian <EMAIL>.
Pseudocode	No	The paper describes the HIGHT framework and its components in detail but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The project is available at https: //higraphllm.github.io/.
Open Datasets	Yes	We conduct extensive experiments to compare HIGHT with previous node-centric tokenization across 14 real-world tasks, including property prediction, molecular description, and chemical reaction prediction. The details and examples regarding the datasets and tasks involved in the experiments are given in Appendix C. We briefly introduce the setups below and leave the details in Appendix D. For the classification, we consider three subtasks, HIV, BACE, and BBBP. The HIV subtask mainly evaluates whether the molecule is able to impede the replication of the HIV virus. The BACE subtask mainly evaluates the binding capability of a molecule to the BACE1 protein. The BBBP subtask mainly evaluates the capability of a molecule to passively diffuse across the human brain blood barrier. For task-specific instruction tuning, we convert those classification based datasets into instructions. Examples are given in Table 12. For regression, we adopt the instruction tuning data from Mol-Instructions (Fang et al., 2024). The regression based property prediction focuses on predicting the quantum mechanics properties of the molecules. The 1D sequence information in this task is given by SELFIES (Krenn et al., 2019). The original data is sourced from the QM9 subset of the Molcule Net (Wu et al., 2017). We adopt three chemical reaction related tasks from Mol-Instructions (Fang et al., 2024): Forward reaction prediction, reagent prediction, and retrosynthesis prediction. The input and output contain 1D sequence information given by SELFIES (Krenn et al., 2019). Some examples of the Mol-Instructions datasets are given in Table 14, where the SELFIES represented molecules are denoted as SELFIES for clarity. The task of forward reaction prediction aims to predict the possible products of a chemical reaction. The input includes the SELFIES sequences of the reactant and reagent of the chemical reaction. And the model needs to predict the SELFIES of the products. The original data is sourced from USPTO 4, which consists of chemical reactions of organic molecules extracted from American patents and patent applications.
Dataset Splits	Yes	The authors collect 33, 010 molecule-text pairs and split them into training (80%), validation (10%), and testing (10%) subsets. We mainly adopt the original training split to tune the model and evaluate the tuned model on the original test split.
Hardware Specification	Yes	We run experiments on Linux Servers with NVIDIA V100 and NVIDIA A100 (40G) graphics cards with CUDA 11.7.
Software Dependencies	Yes	We implement our methods with Py Torch 11.3 (Paszke et al., 2019). We run experiments on Linux Servers with NVIDIA V100 and NVIDIA A100 (40G) graphics cards with CUDA 11.7.
Experiment Setup	Yes	The GNN backbone is a 5-layer GIN (Xu et al., 2019) with a hidden dimension of 300. The adapter is a single-layer MLP. We consider base LLMs of vicunav-1.3-7B (Chiang et al., 2023) for all the tasks and llama-27B-chat (Touvron et al., 2023b) for ablation studies. ... For the Lo RA adapters, we use a Lo RA rank of 128 and a scaling value α of 256 for molecular property prediction (classification) in order to better fit with the task, and use a Lo RA rank of 64 and a scaling value α of 16 for all the remaining methods and tasks. ... In stage 1 instruction tuning, we train all methods based on Pub Chem-295k dataset. The training goes 5 epochs, with a batch size of 64 (distributed to 4 GPUs) by default. If there is an OOM issue, we will decrease the batch size a little bit to 40. The learning rate is set to 2 10 3 for all methods.