reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Number Cookbook: Number Understanding of Language Models and How to Improve It

Authors: Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work provides a more detailed and comprehensive understanding of NUPA in LLMs. Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... We train small models with existing and potential techniques for enhancing NUPA (such as tokenizers, positional encoding, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can significantly improve NUPA on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models.
Researcher Affiliation	Academia	1 School of Intelligence Science and Technology, Peking University 2 Institution for Artificial Intelligence, Peking University 3 State Key Lab of General Artificial Intelligence, Peking University 4 Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China EMAIL EMAIL EMAIL
Pseudocode	Yes	Below is an example of a complete Rule-Following Co T data format, where the model is required to solve integer addition task with a right-to-left recursion and three digit addition as the unit task. Follow the given rule to solve the question. Rule: def add(num1, num2): result = '' carry = 0 # Main Loop while num1 or num2: digit1 = int(num1[-3:]) if num1 else 0 digit2 = int(num2[-3:]) if num2 else 0 total = digit1 + digit2 + carry result = str(total%1000) + result carry = total//1000 num1 = num1[:-3] if num1 else num1 num2 = num2[:-3] if num2 else num2 if carry: result = str(carry) + result result = result.lstrip('0') or '0' return result
Open Source Code	Yes	Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook.
Open Datasets	Yes	Our benchmark and codes are released at https://github.com/Graph PKU/number_cookbook. ... To further facilitate reproducibility, we have included the complete dataset and the source code, enabling the generation of the entire dataset and the training and assessment of models, within the supplementary materials and the github page https://github.com/Graph PKU/number_cookbook.
Dataset Splits	Yes	We generated 1,000 questions for each task and each length. ... For tasks that are inherently more difficult, we limit the size of the problem to 1-20 digits, and for easier tasks to 1-100 digits. ... We generated 1,000 questions for each task and each length. ... We generate training sets (10^5 samples for each digit and each task) and validation sets for our NUPA tasks, ensuring no overlap with the original test set.
Hardware Specification	Yes	Our experiments were conducted on a cluster equipped with Nvidia A800 GPUs (80GB memory).
Software Dependencies	No	The paper mentions the use of 'Huggingface' and 'Transformers library' for Llama setup, and 'Adam W optimizer' but does not specify version numbers for any of these software components.
Experiment Setup	Yes	We keep all hyperparameters, except model size, consistent with the original Llama setup in the implementation from Huggingface. We use the default sampling generation strategy with default hyperparameters, where the temperature is set as 0.6 and top_p is 0.9. ... We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e-5, weight decay of 0.01, and batch sizes of 256, 64, and 32 for 0.1B, 0.9B, and 3B models, respectively. Other optimizer settings follow the default values in the Transformers library.