reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PENCIL: Long Thoughts with Short Memory

Authors: Chenxiao Yang, Nathan Srebro, David Mcallester, Zhiyuan Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, for example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein s puzzle a task that challenges much larger models like GPT-4. ... We train and evaluate PENCIL on SAT, QBF, and Einstein s puzzle tasks that inherently require exponential computation time. ... Table 1: Performance on SAT (left) and QBF (right). Acc denotes the Accuracy (%) and TR denotes the trace rate (%).
Researcher Affiliation	Academia	1Toyota Technological Institute at Chicago. Correspondence to: Chenxiao Yang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Implementation of addition operator, add(ψ1, ψ2) Input :Two seq-to-embedding functions ψ1, ψ2 H(Rd) Output :A seq-to-embedding function ψ H(R) ψcat [ψ1, ψ2] // Concatenate the two functions ψ (ψcat)1 + (ψcat)2 // Linear transformation summation over both coordinates return ψ
Open Source Code	Yes	See discussions about related work in Appendix A. Codes are available at https://github.com/chr26195/ PENCIL.
Open Datasets	No	We train and evaluate PENCIL on SAT, QBF, and Einstein s puzzle tasks that inherently require exponential computation time. ... For each size of the puzzle, we generate 10,000 training instances by randomly assigning attributes to houses and deriving valid constraints that ensure a unique solution.
Dataset Splits	Yes	Evaluation Protocol We evaluate on a held-out validation set of 100 problem instances using two metrics: accuracy (percentage of correct predictions) and trace rate (percentage of reasoning steps matching the ground truth). For all problems, the labels for different classes are balanced. ... For each size of the puzzle, we generate 10,000 training instances by randomly assigning attributes to houses and deriving valid constraints that ensure a unique solution.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory) are provided in the paper. The paper only describes model parameters and context window length: "...a small transformer with 25M-parameter and 2048-token context." and "...6-layer transformer with 10.63M parameters for SAT and QBF problems, and an 8-layer transformer with 25.19M parameters for the more complex Einstein s puzzle. All experiments use a context window of 2048 tokens..."
Software Dependencies	No	No specific software dependencies with version numbers are mentioned in the paper. The text refers to architectural components like "6-layer transformer" and "rotary positional encoding (Su et al., 2024)", but not to specific software libraries or their versions.
Experiment Setup	No	The paper states: "We use the same batch size and learning rate for all methods across experiments." However, it does not specify the actual values for these or any other hyperparameters (e.g., number of epochs, optimizer details), which are critical for reproducing the experimental setup.