reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models Need Inductive Biases to Count Inductively

Authors: Yingshan Chang, Yonatan Bisk

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work provides extensive empirical results on architectures ranging from RNNs, Transformers, State-Space Models and RWKV.
Researcher Affiliation	Academia	Yingshan Chang & Yonatan Bisk Carnegie Mellon University EMAIL
Pseudocode	No	The paper describes methods and mechanisms using textual explanations and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs
Open Datasets	Yes	1Code and data are released https://github.com/zdxdsw/inductive_counting_with_LMs
Dataset Splits	Yes	Since loss is computed at every token, rather than only at the last token, there is no need to include shorter training sequences. In fact, all training sequences have identical lengths equal to MAX_TRAIN_SEQLEN, in order to max out supervision on larger counts. Similarly, every testing sequence has a length equals to MAX_IND/OOD_SEQLEN. Please refer to the rightmost two columns of Figure 1 for their exact values. Training/IND Testing OOD Testing
Hardware Specification	No	The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or cloud computing instances used for the experiments. It only mentions model architectural parameters and training configurations.
Software Dependencies	No	The paper mentions following 'the standard GPT-2 implementation' but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup	Yes	We follow the standard GPT-2 implementation2 and train 1, 2, 4-layer Transformers to count. ... 28 heads, 1,024 dim and 4,096 MLP-dim. LR=1e-4 with 3k steps of linear warmup. Batch size is 32. ... The total length of training is typically 312.5K or 625K steps.