A Reasoning-Based Approach to Cryptic Crossword Clue Solving
Authors: Martin Andrews, Sam Witteveen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection. (...) 4. Experiments |
| Researcher Affiliation | Industry | 1Red Dragon AI, Singapore. Correspondence to: Martin Andrews <EMAIL>. |
| Pseudocode | Yes | Figure 1. Proving process: answer candidate wordplay LLM formalisation (...) Figure 4. Python proving: answer candidate wordplay LLM formalisation |
| Open Source Code | Yes | To promote further study in this area, all code for training the models, the formaliser and domain-specific verifier is made publicly available. (...) 2https://github.com/mdda/crypticcrossword-reasoning-verifier (...) Python code for the complete end-to-end system is available under an Apache 2 license at https://github.com/mdda/ cryptic-crossword-reasoning-verifier. |
| Open Datasets | Yes | The benchmark dataset used by this work is Cryptonite (Efrat et al., 2021) a large-scale dataset of Cryptic Crossword clues from The Times and The Telegraph (major UK newspapers). (...) The Wordplay dataset (Andrews, 2024) an example from which is given in Figure 3 consists of data gathered from websites where cryptic crossword enthusiasts post solutions on a daily basis for each of the major publications. (...) Datasets Resources such as dictionaries used, and the Cryptonite and Wordplay datasets are available online, via the sources referenced in the main text. |
| Dataset Splits | Yes | The dataset contains 523,000 naturally sourced clues (published between 2001 and 2020), with the train, validation and testing splits being chosen so that a given answer can only appear in one of the splits. (...) The Wordplay dataset deliberately follows the train, validation, and test splits defined by Cryptonite. (...) As in Saha et al. (2024), due to computational constraints, we performed sampling of the validation and test sets, using fewer than the full 26k examples available. The standard deviation of these figures is 1.5% at 1000 samples, and 3.3% at 200. |
| Hardware Specification | No | Support for this research was provided by the Google AI Developer Programs team, including access to the Gemini models and GPUs on Google Cloud Platform. (...) The Fine-Tuning of the Gemma2 9B model took around 24 hours for a full Cryptonite training run, and 8 hours for the Wordplay dataset runs. Thus, the single-GPU model runs totalled less than $50 USD. The paper mentions |
| Software Dependencies | Yes | To formalise wordplay into Python proofs of the correctness of solutions, we used Google s Gemini-Flash-1.5-001 LLM (a pinned model version) during development. (...) we fine-tuned a Gemma2 9B base model (Gemma Team & Google Deep Mind, 2024) using the Lo RA (Hu et al., 2022) implementation provided by the unsloth package (unsloth.ai, 2024). |
| Experiment Setup | Yes | The model was trained for 1 epoch on the Cryptonite training set of approximately 470,000 examples. (...) The model was trained on 4 epochs on a set of approximately 16,800 examples (...) For each clue being evaluated, we generate 20 valid answer candidates (...) generate 10 guesses at wordplay (...) This cycle is repeated until a formalisation is validated (zero assertion failures, considered a SUCCESS with the answer having been proved), or max rewrites=2 is reached. (...) candidate generation with t = 1.0 |