reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

M^2LLM: Multi-view Molecular Representation Learning with Large Language Models

Authors: Jiaxin Ju, Yizhen Zheng, Huan Yee Koh, Can Wang, Shirui Pan

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiment 4.1 Experimental Setup Dataset Our framework is evaluated on 8 datasets spanning 34 tasks from Molecule Net [Wu et al., 2018], including physiology-related tasks like BBBP [Martins et al., 2012], Clin Tox [Gayvert et al., 2016], and 27 SIDER tasks [Kuhn et al., 2016] for adverse drug reaction prediction. Additionally, we evaluate classification tasks from BACE [Subramanian et al., 2016] and HIV [Wu et al., 2018], as well as regression tasks from ESOL [Delaney, 2004], Free Solv [Mobley and Guthrie, 2014], and Lipophilicity [Wu et al., 2018]. ... 4.2 Performance on Classification Tasks We evaluate M 2LLM on five classification datasets with 31 subtasks, as shown in Table 1. We report the mean and standard deviation from 10 random seeds using the evaluation metric, receiver operating characteristic-area under the curve (ROC-AUC) (%), where higher scores indicate better performance. ... 4.3 Performance on Regression Tasks We evaluate M 2LLM on three regression tasks, as shown in Table 2. We report the RMSE for regression, where lower values signify better result.
Researcher Affiliation	Academia	Jiaxin Ju1 , Yizhen Zheng2 , Huan Yee Koh2,3 , Can Wang1 and Shirui Pan1 1School of Information and Communication Technology, Griffith University 2Department of Data Science and AI, Monash University 3Drug Discovery Biology, Monash Institute of Pharmaceutical Sciences, Monash University EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the M2LLM framework components and their interactions using narrative text and diagrams (Figure 1 and Figure 2), but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	Our framework is evaluated on 8 datasets spanning 34 tasks from Molecule Net [Wu et al., 2018], including physiology-related tasks like BBBP [Martins et al., 2012], Clin Tox [Gayvert et al., 2016], and 27 SIDER tasks [Kuhn et al., 2016] for adverse drug reaction prediction. Additionally, we evaluate classification tasks from BACE [Subramanian et al., 2016] and HIV [Wu et al., 2018], as well as regression tasks from ESOL [Delaney, 2004], Free Solv [Mobley and Guthrie, 2014], and Lipophilicity [Wu et al., 2018].
Dataset Splits	Yes	We use the scaffold splitting method recommended by Molecule Net [Wu et al., 2018], which assigns molecules with distinct structural scaffolds to separate training, validation, and test sets. This method, unlike random splitting, ensures structural dissimilarity between sets, creating a more challenging evaluation scenario.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running its experiments. It mentions using various LLM models (Galactica, LLaMa-3.1, Open AI's models) but not the underlying hardware for the M2LLM framework's experiments.
Software Dependencies	No	The paper mentions using specific LLM models such as 'Galactica models (6.7B and 30B parameters) [Taylor et al., 2022]', 'LLa Ma-3.1 models (8B and 8B-instruct) [Dubey et al., 2024]', and 'Open AI s closed-source text embedding models (small and large configurations) [Open AI, 2024]'. However, it does not provide specific version numbers for these or other ancillary software libraries or programming languages required for reproduction.
Experiment Setup	No	The paper describes the overall framework, including the multi-layer perceptron (MLP) for prediction and the loss functions used (cross-entropy for classification, RMSE for regression). It also states that 'The framework is trained to optimize the weights αstruct i , αtask i , and αrule i , and the MLP parameters using task-specific loss functions.' However, it does not provide concrete hyperparameter values for the MLP (e.g., number of layers, hidden dimensions, learning rate, batch size, number of epochs) or other system-level training settings for the M2LLM framework.