Reasoning over Uncertain Text by Generative Large Language Models
Authors: Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use BLIn D to find out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLIn D and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs. |
| Researcher Affiliation | Academia | Aliakbar Nafar1, Kristen Brent Venable2,3, Parisa Kordjamshidi1 1Michigan State University 2Florida Institute for Human and Machine Cognition 3University of West Florida |
| Pseudocode | No | The paper describes how LLMs generate code for Python or Prob Log, and refers to 'probabilistic inference algorithms' and 'Monte Carlo Inference Algorithm', but does not present a structured pseudocode or algorithm block for its own methodology. Figure 2 shows examples of LLM-generated code as outputs of the method. |
| Open Source Code | Yes | Code and Dataset https://github.com/HLR/BLIn D |
| Open Datasets | Yes | Code and Dataset https://github.com/HLR/BLIn D. We introduce the Bayesian Linguistic Inference Dataset (BLIn D), a new dataset specifically designed to test the probabilistic reasoning capabilities of LLMs. ...an adaptation of a causal reasoning question-answering dataset, CLADDER (Jin et al. 2023). |
| Dataset Splits | Yes | To assess our methods, we randomly select 100 instances from each data split Vi, resulting in a total of 900 instances. |
| Hardware Specification | No | The paper mentions employing LLM models such as Llama3, GPT3.5, and GPT4, but it does not provide specific hardware details (like GPU/CPU models or types) used to run the experiments with these models. |
| Software Dependencies | No | The paper mentions using the Python library pgmpy (Ankan and Panda 2015) and Prob Log (De Raedt, Kimmig, and Toivonen 2007) but does not provide specific version numbers for these software components. |
| Experiment Setup | No | Refer to the Appendix of the ar Xiv version of the paper for additional information, including our models hyperparameters (the link is provided below the abstract). |