Reasoning over Uncertain Text by Generative Large Language Models

Authors: Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use BLIn D to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLIn D and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.
Researcher Affiliation Academia Aliakbar Nafar1, Kristen Brent Venable2,3, Parisa Kordjamshidi1 1Michigan State University 2Florida Institute for Human and Machine Cognition 3University of West Florida
Pseudocode No The paper describes how LLMs generate code for Python or Prob Log, and refers to 'probabilistic inference algorithms' and 'Monte Carlo Inference Algorithm', but does not present a structured pseudocode or algorithm block for its own methodology. Figure 2 shows examples of LLM-generated code as outputs of the method.
Open Source Code Yes Code and Dataset https://github.com/HLR/BLIn D
Open Datasets Yes Code and Dataset https://github.com/HLR/BLIn D. We introduce the Bayesian Linguistic Inference Dataset (BLIn D), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. ...an adaptation of a causal reasoning question-answering dataset, CLADDER (Jin et al. 2023).
Dataset Splits Yes To assess our methods, we randomly select 100 instances from each data split Vi, resulting in a total of 900 instances.
Hardware Specification No The paper mentions employing LLM models such as Llama3, GPT3.5, and GPT4, but it does not provide specific hardware details (like GPU/CPU models or types) used to run the experiments with these models.
Software Dependencies No The paper mentions using the Python library pgmpy (Ankan and Panda 2015) and Prob Log (De Raedt, Kimmig, and Toivonen 2007) but does not provide specific version numbers for these software components.
Experiment Setup No Refer to the Appendix of the ar Xiv version of the paper for additional information, including our models hyperparameters (the link is provided below the abstract).