ExLM: Rethinking the Impact of $\texttt[MASK]$ Tokens in Masked Language Models

Authors: Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that EXLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that EXLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.
Researcher Affiliation Collaboration 1State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, China. 2International Digital Economy Academy (IDEA), Shenzhen, China. 3College of Computer Science, Sichuan University, Chengdu, China. 4Paul G. Allen School of Computer Science and Engineering, University of Washington, U.S.A. Correspondence to: Ming Zhang <EMAIL>, Zhiping Xiao <EMAIL>.
Pseudocode No More specifically, we adopted a dynamic programming algorithm similar to that used in DA-Transformer (Huang et al., 2022). In this DP scheme, we define fi,u as the cumulative probability of all partial paths ending at state u (a node in the DAG) that have generated the first i tokens of Y. Formally, u indexes the states in our DAG in a manner that respects the acyclic property (i.e., we only move forward in the state sequence), and i ranges over the positions of the target sequence Y. The paper describes the algorithm but does not present it in a structured pseudocode or algorithm block.
Open Source Code Yes 1Code is released at https://github.com/zhengkangjie/ExLM.
Open Datasets Yes For SMILES pre-training, we use the large-scale molecular dataset provided by Zhou et al. (2023), which includes SMILES information for 19 million molecules. We tokenize SMILES sequences with the regular expression from Schwaller et al. (2018). The pretraining hyperparameters can be found in Appendix I. [...] For fine-tuning, we employ the widely-recognized Molecule Net benchmark (Wu et al., 2018). [...] For textual pre-training, we adopt the English Wikipedia and Book Corpus datasets (Devlin, 2018) as the pre-training dataset.
Dataset Splits Yes For fine-tuning, we employ the widely-recognized Molecule Net benchmark (Wu et al., 2018). We follow the same data split as used by Zhou et al. (2023). [...] Table 9. Summary information of the Molecule Net benchmark datasets. Dataset Tasks Task type Molecules (train/valid/test) Describe BACE 1 Classification 1,210/151/151 Binding results of human BACE-1 inhibitors BBBP 1 Classification 1,631/204/204 Blood-brain barrier penetration Clin Tox 2 Multi-label classification 1,182/148/148 Clinical trial toxicity and FDA approval status Tox21 12 Multi-label classification 6,264/783/783 Qualitative toxicity measurements Tox Cast 617 Multi-label classification 6,860/858/858 Toxicology data based on in vitro screening SIDER 27 Multi-label classification 1,141/143/143 Adverse drug reactions to the 27 systemic organs MUV 17 Multi-label classification 74,469/9,309/9,309 A subset of Pub Chem Bio Assay.
Hardware Specification Yes We implement the EXLM model using the Fairseq library 2 and train EXLM on two RTX3090 GPUs for about 24 hours. [...] Both models are trained on two Tesla A100 80G GPUs under the hyperparameters from Table 5, and their training cost statistics are shown in Table 12.
Software Dependencies No We implement the EXLM model using the Fairseq library 2 and train EXLM on two RTX3090 GPUs for about 24 hours. [...] In practice, the computation can be further optimized to O(M) by leveraging parallelized operations provided by Py Torch (Paszke, 2019), making the method highly efficient and suitable for large-scale training. The paper mentions Fairseq and PyTorch but does not provide specific version numbers.
Experiment Setup Yes More detailed training configurations and hyperparameters can be found in Appendix B. The results of the MNLI task, evaluated using accuracy as the primary metric (Williams et al., 2018), are presented in Figure 3. [...] Additionally, during downstream fine-tuning, the input is repeated with the same repetition times k as in pre-training. More detailed training configurations and hyperparameters can be found in Appendix B. [...] The detailed training parameters are provided in Table 6. [...] For more pre-training hyper-parameters, please refer to Table 7. [...] For detailed fine-tuning hyper-parameters, please refer to Table 8. [...] For more details about pre-training settings, please see Appendix M. [...] We also provide the hyperparameter search space for fine-tuning in Appendix N.