PENCIL: Long Thoughts with Short Memory
Authors: Chenxiao Yang, Nathan Srebro, David Mcallester, Zhiyuan Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, for example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein s puzzle a task that challenges much larger models like GPT-4. ... We train and evaluate PENCIL on SAT, QBF, and Einstein s puzzle tasks that inherently require exponential computation time. ... Table 1: Performance on SAT (left) and QBF (right). Acc denotes the Accuracy (%) and TR denotes the trace rate (%). |
| Researcher Affiliation | Academia | 1Toyota Technological Institute at Chicago. Correspondence to: Chenxiao Yang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Implementation of addition operator, add(ψ1, ψ2) Input :Two seq-to-embedding functions ψ1, ψ2 H(Rd) Output :A seq-to-embedding function ψ H(R) ψcat [ψ1, ψ2] // Concatenate the two functions ψ (ψcat)1 + (ψcat)2 // Linear transformation summation over both coordinates return ψ |
| Open Source Code | Yes | See discussions about related work in Appendix A. Codes are available at https://github.com/chr26195/ PENCIL. |
| Open Datasets | No | We train and evaluate PENCIL on SAT, QBF, and Einstein s puzzle tasks that inherently require exponential computation time. ... For each size of the puzzle, we generate 10,000 training instances by randomly assigning attributes to houses and deriving valid constraints that ensure a unique solution. |
| Dataset Splits | Yes | Evaluation Protocol We evaluate on a held-out validation set of 100 problem instances using two metrics: accuracy (percentage of correct predictions) and trace rate (percentage of reasoning steps matching the ground truth). For all problems, the labels for different classes are balanced. ... For each size of the puzzle, we generate 10,000 training instances by randomly assigning attributes to houses and deriving valid constraints that ensure a unique solution. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory) are provided in the paper. The paper only describes model parameters and context window length: "...a small transformer with 25M-parameter and 2048-token context." and "...6-layer transformer with 10.63M parameters for SAT and QBF problems, and an 8-layer transformer with 25.19M parameters for the more complex Einstein s puzzle. All experiments use a context window of 2048 tokens..." |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. The text refers to architectural components like "6-layer transformer" and "rotary positional encoding (Su et al., 2024)", but not to specific software libraries or their versions. |
| Experiment Setup | No | The paper states: "We use the same batch size and learning rate for all methods across experiments." However, it does not specify the actual values for these or any other hyperparameters (e.g., number of epochs, optimizer details), which are critical for reproducing the experimental setup. |