reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Authors: Martin Andrews, Sam Witteveen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection. (...) 4. Experiments
Researcher Affiliation	Industry	1Red Dragon AI, Singapore. Correspondence to: Martin Andrews <EMAIL>.
Pseudocode	Yes	Figure 1. Proving process: answer candidate wordplay LLM formalisation (...) Figure 4. Python proving: answer candidate wordplay LLM formalisation
Open Source Code	Yes	To promote further study in this area, all code for training the models, the formaliser and domain-specific verifier is made publicly available. (...) 2https://github.com/mdda/crypticcrossword-reasoning-verifier (...) Python code for the complete end-to-end system is available under an Apache 2 license at https://github.com/mdda/ cryptic-crossword-reasoning-verifier.
Open Datasets	Yes	The benchmark dataset used by this work is Cryptonite (Efrat et al., 2021) a large-scale dataset of Cryptic Crossword clues from The Times and The Telegraph (major UK newspapers). (...) The Wordplay dataset (Andrews, 2024) an example from which is given in Figure 3 consists of data gathered from websites where cryptic crossword enthusiasts post solutions on a daily basis for each of the major publications. (...) Datasets Resources such as dictionaries used, and the Cryptonite and Wordplay datasets are available online, via the sources referenced in the main text.
Dataset Splits	Yes	The dataset contains 523,000 naturally sourced clues (published between 2001 and 2020), with the train, validation and testing splits being chosen so that a given answer can only appear in one of the splits. (...) The Wordplay dataset deliberately follows the train, validation, and test splits defined by Cryptonite. (...) As in Saha et al. (2024), due to computational constraints, we performed sampling of the validation and test sets, using fewer than the full 26k examples available. The standard deviation of these figures is 1.5% at 1000 samples, and 3.3% at 200.
Hardware Specification	No	Support for this research was provided by the Google AI Developer Programs team, including access to the Gemini models and GPUs on Google Cloud Platform. (...) The Fine-Tuning of the Gemma2 9B model took around 24 hours for a full Cryptonite training run, and 8 hours for the Wordplay dataset runs. Thus, the single-GPU model runs totalled less than $50 USD. The paper mentions
Software Dependencies	Yes	To formalise wordplay into Python proofs of the correctness of solutions, we used Google s Gemini-Flash-1.5-001 LLM (a pinned model version) during development. (...) we fine-tuned a Gemma2 9B base model (Gemma Team & Google Deep Mind, 2024) using the Lo RA (Hu et al., 2022) implementation provided by the unsloth package (unsloth.ai, 2024).
Experiment Setup	Yes	The model was trained for 1 epoch on the Cryptonite training set of approximately 470,000 examples. (...) The model was trained on 4 epochs on a set of approximately 16,800 examples (...) For each clue being evaluated, we generate 20 valid answer candidates (...) generate 10 guesses at wordplay (...) This cycle is repeated until a formalisation is validated (zero assertion failures, considered a SUCCESS with the answer having been proved), or max rewrites=2 is reached. (...) candidate generation with t = 1.0