Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems
Authors: Junyi Ye, Jingyi Gu, Xinyun Zhao, Wenpeng Yin, Guiling Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery. |
| Researcher Affiliation | Academia | 1New Jersey Institute of Technology, Newark, USA 2The Pennsylvania State University, State College, PA, USA |
| Pseudocode | No | The paper describes its methodology in textual form and refers to a figure (Figure 3) for illustration, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/NJIT-AI-Center/Creative Math |
| Open Datasets | No | The paper introduces the CREATIVEMATH dataset and states it was sourced from Art of Problem Solving (Ao PS) with a link to their wiki. However, it does not provide a direct link, DOI, or specific repository for their curated CREATIVEMATH dataset itself. The provided code link is explicitly for 'Code'. |
| Dataset Splits | Yes | We selected a subset from our Creative Math dataset for this study. For each competition, 50 samples were randomly chosen to ensure a representative evaluation of the LLMs performance. The datasets were meticulously curated to ensure that when the problem and all reference solutions were included in the novel solution generation prompt, the total token count did not exceed 3K tokens. In total, the dataset comprises 400 math problems and 605 solutions, forming 605 distinct samples with k varying from 1 to 5. |
| Hardware Specification | Yes | Open-source LLMs were run using the Hugging Face library on one to four NVIDIA A100 (80G) GPUs, depending on the model s memory requirements. |
| Software Dependencies | No | The paper mentions using the 'Hugging Face library' but does not specify any version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | To ensure reproducibility, all experiments were conducted using the greedy decoding strategy, adhering to the recommended settings provided on the official Hugging Face pages or the models respective papers. The system prompt followed the guidelines outlined in the models documentation, with the maximum number of new tokens set to 1024. |