reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Energy-Based Flow Matching for Generating 3D Molecular Structure

Authors: Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on protein docking as well as protein backbone generation consistently demonstrate the method s effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.
Researcher Affiliation	Academia	1KTH Royal Institute of Technology, Stockholm, Sweden 2Science for Life Laboratory, Sweden 3The Alan Turing Institute, London, United Kingdom 4DBB at Stockholm University, Sweden 5Max Planck Institute for Polymer Research, Mainz, Germany. Correspondence to: Wenyin Zhou <EMAIL>.
Pseudocode	Yes	Algorithm 1 Idempotent Flow Map Training Algorithm 2 Predictor Refiner Sampler
Open Source Code	Yes	The source code is available at https://github.com/Caviar Lover/IDFlow.
Open Datasets	Yes	To evaluate the structure generation capability of IDFlow for pocket-level docking, we train the model and its ablations on the PDBBind v2020 dataset (Liu et al., 2017) for both the time and the 30% sequence similarity split and the Binding MOAD (Hu et al., 2005) with 30% sequence similarity split, as proposed in (St ark et al., 2023). We first conduct the experiments on a small curated dataset SCOPe (Fox et al., 2014; Chandonia et al., 2022) comprised of 3928 protein structures filtered by lengths between 60 and 128 residues. Next, we evaluate the IDFlow on the subset of PDB, with maximum protein length 512 and maximum coil content of 50 % filtering process following (Yim et al., 2023b).
Dataset Splits	Yes	The PDBBind v2020(Liu et al., 2017) with a total of 19k complexes timesplit is a commonly used benchmark for molecular docking (St ark et al., 2022; Lu et al., 2022; Corso et al., 2023; Zhang et al., 2023; Pei et al., 2024; Corso et al., 2024). The time split proposed by (St ark et al., 2022) consists of 17k complexes before 2019 for training and validation and 363 complexes after 2019 for testing without the seen ligand in the training set. The 30% sequence similarity split is constructed from the same dataset but with the constraint of the chain-wise similarity less than 30%, which is considered as a more difficult split for the timesplit. Binding MOAD (Hu et al., 2005) is another curated dataset from the PDB, with a different preprocessing pipeline from the PDBBind, ending up with 41k complexes. Similar to the PDBBind, the maximum 30% sequence similarity split provides 56649, 1136 and 1288 for training, validation and test examples.
Hardware Specification	Yes	The training takes around 20 hours on 8 RTX A100 GPUs for single ligand docking and one day on multi-ligand docking. We train the model on 8 RTX A100 GPUs for 150 epochs ( 22 hours) on SCOPe and 600 epochs ( 3 days) on PDB.
Software Dependencies	No	The implementation is based on the e3nn library (Geiger & Smidt, 2022). The model is trained for 150 epochs using the Adam optimizer (Kingma & Ba, 2014) with the initial learning rate 1e-3 and a polynomial scheduler.
Experiment Setup	Yes	The number of vector features and scalar features for TFN is set to be 32 and 8, respectively. Hence, there is no higher order representation (> 1) being used in the experiment and we do not use batch normalization and residual connection for the aggregated messages, but only layernorm the input features for each layer. The batch is set to be 4 for each GPU. The model is trained for 150 epochs using the Adam optimizer (Kingma & Ba, 2014) with the initial learning rate 1e-3 and a polynomial scheduler. The flow matching conditional standard deviation is set to be constant σt = σ = 0.5. The training takes around 20 hours on 8 RTX A100 GPUs for single ligand docking and one day on multi-ligand docking. The validation is conducted for every epoch, and the checkpoint with the largest RMSD <2 Ais selected for inference. The number of function evaluations is set to be 20 consistent with the Harmonic Flow. More details can be found in the Git Hub repository of Harmonic Flow (https://github.com/Hannes Stark/Flow Site). We train the model on 8 RTX A100 GPUs for 150 epochs ( 22 hours) on SCOPe and 600 epochs ( 3 days) on PDB. After Ntrain epochs of training, the checkpoints are swept every Nsweep for inference. The checkpoint achieving the highest designability is selected for the final result. Respectively, Ntrain is set to 100 and 300 for SCOPe and PDB, and Nsweep is set to 10 and 50. The number of iteration Kmax is set to be 1, as any value beyond 1 fills out of the memory even on fat GPU. The other hyperparameters are kept to be the same as the Frame Flow (https://github.com/microsoft/protein-frame-flow). We report the key setting here: Hyperparameter Setup learning rate 1e-4 node embedding dimension 256 edge embedding dimension 128 number of head for IPA 8 number of query / key points 8 number of value channel 12 number of head for transformer layer 4 number of layer 6