reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers Provably Solve Parity Efficiently with Chain of Thought

Authors: Juno Kim, Taiji Suzuki

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings, supported by numerical experiments, show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with Co T; moreover, self-consistency checking can improve multistep reasoning ability, aligning with empirical studies of Co T. We conduct numerical experiments supporting our findings (Section 4 and Appendix D).
Researcher Affiliation	Academia	Juno Kim1,2 Taiji Suzuki1,2 1Department of Mathematical Informatics, University of Tokyo 2Center for Advanced Intelligence Project, RIKEN
Pseudocode	No	The paper describes the model architecture and processes using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not explicitly state that code is available, nor does it provide any links to a code repository or mention code in supplementary materials.
Open Datasets	No	The paper describes generating synthetic data for the k-parity problem: 'x = (xj)d j=1 Unif({ 1}d), where the output y = Q j p xj is determined by the parity of an unknown subset of bits p P.' It mentions using '100K 64-bit samples' but does not provide access information (link, DOI, repository, or citation) for a pre-existing public dataset.
Dataset Splits	No	The paper mentions training on '100K 64-bit samples' and evaluating 'prediction loss', implying a training and testing phase. However, it does not provide specific details on how these samples are split into training, validation, or test sets.
Hardware Specification	Yes	All models are optimized using full-batch gradient descent on 100K 64-bit samples with a single Tesla T4 GPU.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, such as programming languages, deep learning frameworks, or libraries used in the implementation.
Experiment Setup	Yes	For the transformer architecture, the feedforward layer was fixed to the following piecewise quadratic link function: 4t^2 - 8t - 3 t in [-1, 0.5), 4t^2 - 1 t in [-0.5, 0.5), 4t^2 + 8t - 3 t in [0.5, 1]. For all Co T models, learning rates were fixed to η = 15, 50, 100 for k = 8, 16, 32. For the direct model, the learning rate was scaled to 0.01η to ensure stability of training. For the self-consistency model, filtering was done through an equivalent weight-based filter, which checks if any softmax score exceeds a threshold value, here set to 0.4. Moreover, we found that adding a 10% fraction of the gradient from the prediction loss to that of the Co T loss resulted in more stable training. Figure 4 shows training curves... over 350 epochs when k = 32.