reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

Authors: Junyi Ye, Jingyi Gu, Xinyun Zhao, Wenpeng Yin, Guiling Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.
Researcher Affiliation	Academia	1New Jersey Institute of Technology, Newark, USA 2The Pennsylvania State University, State College, PA, USA
Pseudocode	No	The paper describes its methodology in textual form and refers to a figure (Figure 3) for illustration, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/NJIT-AI-Center/Creative Math
Open Datasets	No	The paper introduces the CREATIVEMATH dataset and states it was sourced from Art of Problem Solving (Ao PS) with a link to their wiki. However, it does not provide a direct link, DOI, or specific repository for their curated CREATIVEMATH dataset itself. The provided code link is explicitly for 'Code'.
Dataset Splits	Yes	We selected a subset from our Creative Math dataset for this study. For each competition, 50 samples were randomly chosen to ensure a representative evaluation of the LLMs performance. The datasets were meticulously curated to ensure that when the problem and all reference solutions were included in the novel solution generation prompt, the total token count did not exceed 3K tokens. In total, the dataset comprises 400 math problems and 605 solutions, forming 605 distinct samples with k varying from 1 to 5.
Hardware Specification	Yes	Open-source LLMs were run using the Hugging Face library on one to four NVIDIA A100 (80G) GPUs, depending on the model s memory requirements.
Software Dependencies	No	The paper mentions using the 'Hugging Face library' but does not specify any version numbers for this or any other software dependencies.
Experiment Setup	Yes	To ensure reproducibility, all experiments were conducted using the greedy decoding strategy, adhering to the recommended settings provided on the official Hugging Face pages or the models respective papers. The system prompt followed the guidelines outlined in the models documentation, with the maximum number of new tokens set to 1024.