Transformers Provably Solve Parity Efficiently with Chain of Thought

Authors: Juno Kim, Taiji Suzuki

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings, supported by numerical experiments, show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with Co T; moreover, self-consistency checking can improve multistep reasoning ability, aligning with empirical studies of Co T. We conduct numerical experiments supporting our findings (Section 4 and Appendix D).
Researcher Affiliation Academia Juno Kim1,2 Taiji Suzuki1,2 1Department of Mathematical Informatics, University of Tokyo 2Center for Advanced Intelligence Project, RIKEN
Pseudocode No The paper describes the model architecture and processes using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not explicitly state that code is available, nor does it provide any links to a code repository or mention code in supplementary materials.
Open Datasets No The paper describes generating synthetic data for the k-parity problem: 'x = (xj)d j=1 Unif({ 1}d), where the output y = Q j p xj is determined by the parity of an unknown subset of bits p P.' It mentions using '100K 64-bit samples' but does not provide access information (link, DOI, repository, or citation) for a pre-existing public dataset.
Dataset Splits No The paper mentions training on '100K 64-bit samples' and evaluating 'prediction loss', implying a training and testing phase. However, it does not provide specific details on how these samples are split into training, validation, or test sets.
Hardware Specification Yes All models are optimized using full-batch gradient descent on 100K 64-bit samples with a single Tesla T4 GPU.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as programming languages, deep learning frameworks, or libraries used in the implementation.
Experiment Setup Yes For the transformer architecture, the feedforward layer was fixed to the following piecewise quadratic link function: 4t^2 - 8t - 3 t in [-1, 0.5), 4t^2 - 1 t in [-0.5, 0.5), 4t^2 + 8t - 3 t in [0.5, 1]. For all Co T models, learning rates were fixed to η = 15, 50, 100 for k = 8, 16, 32. For the direct model, the learning rate was scaled to 0.01η to ensure stability of training. For the self-consistency model, filtering was done through an equivalent weight-based filter, which checks if any softmax score exceeds a threshold value, here set to 0.4. Moreover, we found that adding a 10% fraction of the gradient from the prediction loss to that of the Co T loss resulted in more stable training. Figure 4 shows training curves... over 350 epochs when k = 32.