reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Authors: Mahdi Mostajabdaveh, Timothy Tin Long Yu, Samarendra Chandan Bindu Dash, Rindra Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations of various open-source LLMs, such as LLa MA 3.1, Deep Seek, and Mixtral reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, providing insights for future research in this area. The dataset and evaluation code are publicly available.
Researcher Affiliation	Collaboration	Mahdi Mostajabdaveh 1, Timothy Tin Long Yu 1, Samarendra Chandan Bindu Dash 1,2, Rindranirina Ramamonjison 1, Jabo Serge Byusa 1, Giuseppe Carenini 3, Zirui Zhou 1, Yong Zhang 1, 1Huawei Technologies Canada, 4321 Still Creek Dr, Burnaby, BC, V5C 6S7, Canada 2 University of Toronto, 40 George St, Toronto, ON, M5S 2E4, Canada 3University of British Columbia, 2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
Pseudocode	No	The paper describes methods and processes (e.g., dataset creation process in Figure 3) and provides examples of prompt components (Figure 4) and reasoning steps (Figure 1), but it does not contain any structured pseudocode or algorithm blocks for a specific method or procedure.
Open Source Code	Yes	The dataset and evaluation code are publicly available. Code and Dataset https://developer.huaweicloud.com/ develop/aigallery/notebook/detail?id=6b98c56e-913b-47ef-8d9f-3266c8aec06a
Open Datasets	Yes	The dataset and evaluation code are publicly available. Code and Dataset https://developer.huaweicloud.com/ develop/aigallery/notebook/detail?id=6b98c56e-913b-47ef-8d9f-3266c8aec06a
Dataset Splits	Yes	ORQA is comprised of 1513 data instances with 45 instances allocated as the validation set. ... Table 1: ORQA dataset statistics. Number of instances 1513 Test/validation split 1468/45
Hardware Specification	No	The paper evaluates various LLMs (e.g., Llama3.1-8B-I, Falcon-7B-Instruct) and mentions a specific model's slow generation speed (Numina Math model, which took around 10 days to generate reasoning steps), but it does not provide specific details on the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper lists the LLM models evaluated (e.g., LLaMA 3.1, Deep Seek, Mixtral, FLAN-T5 XXL, Falcon-7B-Instruct, Mistral, Llama2), some with references to their initial publication (e.g., Chung et al. 2024 for FLAN-T5 XXL). However, it does not provide specific version numbers for ancillary software dependencies (such as Python, PyTorch, or CUDA) that would be needed to replicate the experimental environment.
Experiment Setup	Yes	We evaluated each model on the 1468 instances of data from the test set for its standard and Co T prompting capabilities in both zero-shot and few-shot settings, as described in Section 3. ... These experiments were conducted using the following settings: 0-shot with Llama-3.1-70B-Instruct, temperature set to 0.7, and each trigger prompt was run five times.