Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling
Authors: Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Space Former significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models. [...] We conduct extensive experiments to evaluate the effectiveness and efficiency of Space Former. Across a total of 15 diverse downstream tasks, Space Former achieves the best performance on 10 tasks and ranks within the top 2 on 14 tasks. Ablation studies further confirm that each component plays a critical role in enhancing either the performance or efficiency of Space Former. |
| Researcher Affiliation | Collaboration | 1DP Technology, Beijing, China 2Peking University, Beijing, China. Correspondence to: Guolin Ke <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and components in detail, including 'grid-based space discretization', 'grid sampling/merging', and 'efficient 3D positional encoding', but it does not present these as structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Pretraining settings. We use the same pretraining dataset as Zhou et al. (2023), which contains a total of 19 million molecules. [...] For computational properties, we sample a 20K subset from the huge dataset of GDB-17 (Ramakrishnan et al., 2014; Ruddigkeit et al., 2012) and select the electronic properties HOMO, LUMO and GAP. Additionally, we use another 21k subset from the same dataset following (Ramakrishnan et al., 2015), selecting the energy properties E1-CC2, E2-CC2, f1-CC2 and f2-CC2. Furthermore, we incorporate the dataset (Wahab et al., 2022) of cata-condensed polybenzenoid hydrocarbons [...] For experimental properties, we select the BBBP and BACE datasets from Molecule Net, ensuring that all duplicate and structurally invalid molecules were excluded. Additionally, we employ the HLM, MDR1-MDCK ER (MME), and Solubility (Solu) datasets from the Biogen ADME dataset (Fang et al., 2023). |
| Dataset Splits | Yes | In all tasks, datasets were split into training, validation, and test sets in an 8:1:1 ratio. We applied the Out-of-Distribution splitting methods, where the sets are divided based on scaffold similarity. |
| Hardware Specification | Yes | This configuration results in a model with approximately 67.8M (encoder) parameters and requires about 50 hours of training using 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions techniques like Flash Attention (Dao et al., 2022) and Ro PE (Su et al., 2024), but does not provide specific version numbers for any software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | The pretraining settings are detailed in Table 5, the downstream finetuning settings in Table 6, and the downstream tasks in Table 7. [...] Table 5. Pretraining Settings Hyper-parameters Value Peak learning rate 1e-4 [...] Batch size 128 [...] Mask ratio 0.3 Cell edge length cl 0.49 A [...] Table 6. Fine-tuning Settings Hyper-parameters Value Peak learning rate [5e-5, 1e-4] Batch size [32, 64] Epochs 200 Pooler dropout [0.0, 0.1] |