Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Authors: Mahdi Mostajabdaveh, Timothy Tin Long Yu, Samarendra Chandan Bindu Dash, Rindra Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou, Yong Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations of various open-source LLMs, such as LLa MA 3.1, Deep Seek, and Mixtral reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, providing insights for future research in this area. The dataset and evaluation code are publicly available.
Researcher Affiliation Collaboration Mahdi Mostajabdaveh 1, Timothy Tin Long Yu 1, Samarendra Chandan Bindu Dash 1,2, Rindranirina Ramamonjison 1, Jabo Serge Byusa 1, Giuseppe Carenini 3, Zirui Zhou 1, Yong Zhang 1, 1Huawei Technologies Canada, 4321 Still Creek Dr, Burnaby, BC, V5C 6S7, Canada 2 University of Toronto, 40 George St, Toronto, ON, M5S 2E4, Canada 3University of British Columbia, 2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
Pseudocode No The paper describes methods and processes (e.g., dataset creation process in Figure 3) and provides examples of prompt components (Figure 4) and reasoning steps (Figure 1), but it does not contain any structured pseudocode or algorithm blocks for a specific method or procedure.
Open Source Code Yes The dataset and evaluation code are publicly available. Code and Dataset https://developer.huaweicloud.com/ develop/aigallery/notebook/detail?id=6b98c56e-913b-47ef-8d9f-3266c8aec06a
Open Datasets Yes The dataset and evaluation code are publicly available. Code and Dataset https://developer.huaweicloud.com/ develop/aigallery/notebook/detail?id=6b98c56e-913b-47ef-8d9f-3266c8aec06a
Dataset Splits Yes ORQA is comprised of 1513 data instances with 45 instances allocated as the validation set. ... Table 1: ORQA dataset statistics. Number of instances 1513 Test/validation split 1468/45
Hardware Specification No The paper evaluates various LLMs (e.g., Llama3.1-8B-I, Falcon-7B-Instruct) and mentions a specific model's slow generation speed (Numina Math model, which took around 10 days to generate reasoning steps), but it does not provide specific details on the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper lists the LLM models evaluated (e.g., LLaMA 3.1, Deep Seek, Mixtral, FLAN-T5 XXL, Falcon-7B-Instruct, Mistral, Llama2), some with references to their initial publication (e.g., Chung et al. 2024 for FLAN-T5 XXL). However, it does not provide specific version numbers for ancillary software dependencies (such as Python, PyTorch, or CUDA) that would be needed to replicate the experimental environment.
Experiment Setup Yes We evaluated each model on the 1468 instances of data from the test set for its standard and Co T prompting capabilities in both zero-shot and few-shot settings, as described in Section 3. ... These experiments were conducted using the following settings: 0-shot with Llama-3.1-70B-Instruct, temperature set to 0.7, and each trigger prompt was run five times.