Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work provides extensive empirical results on architectures ranging from RNNs, Transformers, State-Space Models and RWKV.
Researcher Affiliation Academia Yingshan Chang & Yonatan Bisk Carnegie Mellon University EMAIL
Pseudocode No The paper describes methods and mechanisms using textual explanations and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes 1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs
Open Datasets Yes 1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs
Dataset Splits Yes Since loss is computed at every token, rather than only at the last token, there is no need to include shorter training sequences. In fact, all training sequences have identical lengths equal to MAX_TRAIN_SEQLEN, in order to max out supervision on larger counts. Similarly, every testing sequence has a length equals to MAX_IND/OOD_SEQLEN. Please refer to the rightmost two columns of Figure 1 for their exact values. Training/IND Testing OOD Testing
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or cloud computing instances used for the experiments. It only mentions model architectural parameters and training configurations.
Software Dependencies No The paper mentions following 'the standard GPT-2 implementation' but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes We follow the standard GPT-2 implementation2 and train 1, 2, 4-layer Transformers to count. ... 28 heads, 1,024 dim and 4,096 MLP-dim. LR=1e-4 with 3k steps of linear warmup. Batch size is 32. ... The total length of training is typically 312.5K or 625K steps.