Number Cookbook: Number Understanding of Language Models and How to Improve It

Authors: Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... We train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, positional encoding, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can significantly improve NUPA on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models.
Researcher Affiliation Academia 1 School of Intelligence Science and Technology, Peking University 2 Institution for Artificial Intelligence, Peking University 3 State Key Lab of General Artificial Intelligence, Peking University 4 Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China EMAIL EMAIL EMAIL
Pseudocode Yes Below is an example of a complete Rule-Following Co T data format, where the model is required to solve integer addition task with a right-to-left recursion and three digit addition as the unit task. Follow the given rule to solve the question. Rule: def add(num1, num2): result = '' carry = 0 # Main Loop while num1 or num2: digit1 = int(num1[-3:]) if num1 else 0 digit2 = int(num2[-3:]) if num2 else 0 total = digit1 + digit2 + carry result = str(total%1000) + result carry = total//1000 num1 = num1[:-3] if num1 else num1 num2 = num2[:-3] if num2 else num2 if carry: result = str(carry) + result result = result.lstrip('0') or '0' return result
Open Source Code Yes Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook.
Open Datasets Yes Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... To further facilitate reproducibility, we have included the complete dataset and the source code, enabling the generation of the entire dataset and the training and assessment of models, within the supplementary materials and the github page https://github.com/Graph PKU/number_cookbook.
Dataset Splits Yes We generated 1,000 questions for each task and each length. ... For tasks that are inherently more difficult, we limit the size of the problem to 1-20 digits, and for easier tasks to 1-100 digits. ... We generated 1,000 questions for each task and each length. ... We generate training sets (10^5 samples for each digit and each task) and validation sets for our NUPA tasks, ensuring no overlap with the original test set.
Hardware Specification Yes Our experiments were conducted on a cluster equipped with Nvidia A800 GPUs (80GB memory).
Software Dependencies No The paper mentions the use of 'Huggingface' and 'Transformers library' for Llama setup, and 'Adam W optimizer' but does not specify version numbers for any of these software components.
Experiment Setup Yes We keep all hyperparameters, except model size, consistent with the original Llama setup in the implementation from Huggingface. We use the default sampling generation strategy with default hyperparameters, where the temperature is set as 0.6 and top_p is 0.9. ... We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e-5, weight decay of 0.01, and batch sizes of 256, 64, and 32 for 0.1B, 0.9B, and 3B models, respectively. Other optimizer settings follow the default values in the Transformers library.