Language Models Need Inductive Biases to Count Inductively
Authors: Yingshan Chang, Yonatan Bisk
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work provides extensive empirical results on architectures ranging from RNNs, Transformers, State-Space Models and RWKV. |
| Researcher Affiliation | Academia | Yingshan Chang & Yonatan Bisk Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes methods and mechanisms using textual explanations and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | 1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs |
| Open Datasets | Yes | 1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs |
| Dataset Splits | Yes | Since loss is computed at every token, rather than only at the last token, there is no need to include shorter training sequences. In fact, all training sequences have identical lengths equal to MAX_TRAIN_SEQLEN, in order to max out supervision on larger counts. Similarly, every testing sequence has a length equals to MAX_IND/OOD_SEQLEN. Please refer to the rightmost two columns of Figure 1 for their exact values. Training/IND Testing OOD Testing |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or cloud computing instances used for the experiments. It only mentions model architectural parameters and training configurations. |
| Software Dependencies | No | The paper mentions following 'the standard GPT-2 implementation' but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We follow the standard GPT-2 implementation2 and train 1, 2, 4-layer Transformers to count. ... 28 heads, 1,024 dim and 4,096 MLP-dim. LR=1e-4 with 3k steps of linear warmup. Batch size is 32. ... The total length of training is typically 312.5K or 625K steps. |