A Compact Model for Mathematics Problem Representations Distilled from BERT
Authors: Hao Ming, Xinguo Yu, Xiaotian Cheng, Zhenquan Shen, Xiaopan Lyu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based task-agnostic compact models. The efficacy of each component has been validated through ablation studies. Our experiments mainly use four commonly used Chinese MWP datasets. |
| Researcher Affiliation | Academia | 1Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology with figures and text but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an unambiguous statement about releasing code for the described methodology, nor does it provide a direct link to a code repository. It mentions 'Our model is implemented by Py Torch' but this is not a code release. |
| Open Datasets | Yes | Our experiments mainly use four commonly used Chinese MWP datasets. Math23k is the most widely used dataset which contains 23162 math application problems with annotated equations and answers. Ape210k is an enormous math dataset including 210488 MWPs. Since Ape210k has many noisy examples that miss annotations or cannot be solved, we use the re-organized datasets called Ape-clean (Liang et al. 2022) and full Ape210k can still be used for MLM pretraining. HMWP consists of 5470 MWPs including multi-unknown problems and non-linear problems, making problem solving more challenging. CM17K is another large-scale MWP dataset, which contains 6215 arithmetic problems, 5193 one-unknown linear problems, 3129 one-unknown nonlinear problems, and 2498 equation set problems. |
| Dataset Splits | No | The paper mentions using Math23k and Ape-clean datasets for evaluation but does not specify the training, validation, or test splits used for these datasets (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | Yes | Our model is implemented by Py Torch on an NVIDIA A800 100 GB GPU. |
| Software Dependencies | No | The paper states 'Our model is implemented by Py Torch' but does not specify a version number for PyTorch or any other software libraries used. |
| Experiment Setup | Yes | At the pretraining stage, 150 epochs are trained using the Adam optimizer with the initial learning rate of 1e-5 and weight decay of 1e-5, the mini-batch size is set to be 128. At the task-specific distillation stage, we also use the Adam optimizer, and the initial learning rate is set as 3e-5, and we pre-train them 120 epochs. The loss weights α1, α2, and α3 of our distillation loss obtained by grid search are set as 1.0, 0.9, and 1.0. According to our extensive experiments, the hyperparameter θ and λ are set as 0.2 and 0.5, respectively. The temperature factor for soft labels is set to 4. We fine-tune the student model using a batch size of 32 for 100 epochs, and the dropout rate is 0.1. The initial fine-tuning learning rate is set as 1e-5 and 1e-4 for the student model and GTS, respectively. |