Transformers Provably Solve Parity Efficiently with Chain of Thought
Authors: Juno Kim, Taiji Suzuki
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings, supported by numerical experiments, show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with Co T; moreover, self-consistency checking can improve multistep reasoning ability, aligning with empirical studies of Co T. We conduct numerical experiments supporting our findings (Section 4 and Appendix D). |
| Researcher Affiliation | Academia | Juno Kim1,2 Taiji Suzuki1,2 1Department of Mathematical Informatics, University of Tokyo 2Center for Advanced Intelligence Project, RIKEN |
| Pseudocode | No | The paper describes the model architecture and processes using mathematical formulas and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not explicitly state that code is available, nor does it provide any links to a code repository or mention code in supplementary materials. |
| Open Datasets | No | The paper describes generating synthetic data for the k-parity problem: 'x = (xj)d j=1 Unif({ 1}d), where the output y = Q j p xj is determined by the parity of an unknown subset of bits p P.' It mentions using '100K 64-bit samples' but does not provide access information (link, DOI, repository, or citation) for a pre-existing public dataset. |
| Dataset Splits | No | The paper mentions training on '100K 64-bit samples' and evaluating 'prediction loss', implying a training and testing phase. However, it does not provide specific details on how these samples are split into training, validation, or test sets. |
| Hardware Specification | Yes | All models are optimized using full-batch gradient descent on 100K 64-bit samples with a single Tesla T4 GPU. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, such as programming languages, deep learning frameworks, or libraries used in the implementation. |
| Experiment Setup | Yes | For the transformer architecture, the feedforward layer was fixed to the following piecewise quadratic link function: 4t^2 - 8t - 3 t in [-1, 0.5), 4t^2 - 1 t in [-0.5, 0.5), 4t^2 + 8t - 3 t in [0.5, 1]. For all Co T models, learning rates were fixed to η = 15, 50, 100 for k = 8, 16, 32. For the direct model, the learning rate was scaled to 0.01η to ensure stability of training. For the self-consistency model, filtering was done through an equivalent weight-based filter, which checks if any softmax score exceeds a threshold value, here set to 0.4. Moreover, we found that adding a 10% fraction of the gradient from the prediction loss to that of the Co T loss resulted in more stable training. Figure 4 shows training curves... over 350 epochs when k = 32. |