Number Cookbook: Number Understanding of Language Models and How to Improve It
Authors: Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... We train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, positional encoding, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can significantly improve NUPA on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. |
| Researcher Affiliation | Academia | 1 School of Intelligence Science and Technology, Peking University 2 Institution for Artificial Intelligence, Peking University 3 State Key Lab of General Artificial Intelligence, Peking University 4 Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Below is an example of a complete Rule-Following Co T data format, where the model is required to solve integer addition task with a right-to-left recursion and three digit addition as the unit task. Follow the given rule to solve the question. Rule: def add(num1, num2): result = '' carry = 0 # Main Loop while num1 or num2: digit1 = int(num1[-3:]) if num1 else 0 digit2 = int(num2[-3:]) if num2 else 0 total = digit1 + digit2 + carry result = str(total%1000) + result carry = total//1000 num1 = num1[:-3] if num1 else num1 num2 = num2[:-3] if num2 else num2 if carry: result = str(carry) + result result = result.lstrip('0') or '0' return result |
| Open Source Code | Yes | Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. |
| Open Datasets | Yes | Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... To further facilitate reproducibility, we have included the complete dataset and the source code, enabling the generation of the entire dataset and the training and assessment of models, within the supplementary materials and the github page https://github.com/Graph PKU/number_cookbook. |
| Dataset Splits | Yes | We generated 1,000 questions for each task and each length. ... For tasks that are inherently more difficult, we limit the size of the problem to 1-20 digits, and for easier tasks to 1-100 digits. ... We generated 1,000 questions for each task and each length. ... We generate training sets (10^5 samples for each digit and each task) and validation sets for our NUPA tasks, ensuring no overlap with the original test set. |
| Hardware Specification | Yes | Our experiments were conducted on a cluster equipped with Nvidia A800 GPUs (80GB memory). |
| Software Dependencies | No | The paper mentions the use of 'Huggingface' and 'Transformers library' for Llama setup, and 'Adam W optimizer' but does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | We keep all hyperparameters, except model size, consistent with the original Llama setup in the implementation from Huggingface. We use the default sampling generation strategy with default hyperparameters, where the temperature is set as 0.6 and top_p is 0.9. ... We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e-5, weight decay of 0.01, and batch sizes of 256, 64, and 32 for 0.1B, 0.9B, and 3B models, respectively. Other optimizer settings follow the default values in the Transformers library. |