reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ExLM: Rethinking the Impact of $\texttt[MASK]$ Tokens in Masked Language Models

Authors: Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that EXLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that EXLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.
Researcher Affiliation	Collaboration	1State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, China. 2International Digital Economy Academy (IDEA), Shenzhen, China. 3College of Computer Science, Sichuan University, Chengdu, China. 4Paul G. Allen School of Computer Science and Engineering, University of Washington, U.S.A. Correspondence to: Ming Zhang <EMAIL>, Zhiping Xiao <EMAIL>.
Pseudocode	No	More specifically, we adopted a dynamic programming algorithm similar to that used in DA-Transformer (Huang et al., 2022). In this DP scheme, we define fi,u as the cumulative probability of all partial paths ending at state u (a node in the DAG) that have generated the first i tokens of Y. Formally, u indexes the states in our DAG in a manner that respects the acyclic property (i.e., we only move forward in the state sequence), and i ranges over the positions of the target sequence Y. The paper describes the algorithm but does not present it in a structured pseudocode or algorithm block.
Open Source Code	Yes	1Code is released at https://github.com/zhengkangjie/ExLM.
Open Datasets	Yes	For SMILES pre-training, we use the large-scale molecular dataset provided by Zhou et al. (2023), which includes SMILES information for 19 million molecules. We tokenize SMILES sequences with the regular expression from Schwaller et al. (2018). The pretraining hyperparameters can be found in Appendix I. [...] For fine-tuning, we employ the widely-recognized Molecule Net benchmark (Wu et al., 2018). [...] For textual pre-training, we adopt the English Wikipedia and Book Corpus datasets (Devlin, 2018) as the pre-training dataset.
Dataset Splits	Yes	For fine-tuning, we employ the widely-recognized Molecule Net benchmark (Wu et al., 2018). We follow the same data split as used by Zhou et al. (2023). [...] Table 9. Summary information of the Molecule Net benchmark datasets. Dataset Tasks Task type Molecules (train/valid/test) Describe BACE 1 Classification 1,210/151/151 Binding results of human BACE-1 inhibitors BBBP 1 Classification 1,631/204/204 Blood-brain barrier penetration Clin Tox 2 Multi-label classification 1,182/148/148 Clinical trial toxicity and FDA approval status Tox21 12 Multi-label classification 6,264/783/783 Qualitative toxicity measurements Tox Cast 617 Multi-label classification 6,860/858/858 Toxicology data based on in vitro screening SIDER 27 Multi-label classification 1,141/143/143 Adverse drug reactions to the 27 systemic organs MUV 17 Multi-label classification 74,469/9,309/9,309 A subset of Pub Chem Bio Assay.
Hardware Specification	Yes	We implement the EXLM model using the Fairseq library 2 and train EXLM on two RTX3090 GPUs for about 24 hours. [...] Both models are trained on two Tesla A100 80G GPUs under the hyperparameters from Table 5, and their training cost statistics are shown in Table 12.
Software Dependencies	No	We implement the EXLM model using the Fairseq library 2 and train EXLM on two RTX3090 GPUs for about 24 hours. [...] In practice, the computation can be further optimized to O(M) by leveraging parallelized operations provided by Py Torch (Paszke, 2019), making the method highly efficient and suitable for large-scale training. The paper mentions Fairseq and PyTorch but does not provide specific version numbers.
Experiment Setup	Yes	More detailed training configurations and hyperparameters can be found in Appendix B. The results of the MNLI task, evaluated using accuracy as the primary metric (Williams et al., 2018), are presented in Figure 3. [...] Additionally, during downstream fine-tuning, the input is repeated with the same repetition times k as in pre-training. More detailed training configurations and hyperparameters can be found in Appendix B. [...] The detailed training parameters are provided in Table 6. [...] For more pre-training hyper-parameters, please refer to Table 7. [...] For detailed fine-tuning hyper-parameters, please refer to Table 8. [...] For more details about pre-training settings, please see Appendix M. [...] We also provide the hyperparameter search space for fine-tuning in Appendix N.