reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reasoning over Uncertain Text by Generative Large Language Models

Authors: Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use BLIn D to ﬁnd out the limitations of LLMs for tasks involving probabilistic reasoning. In addition, we present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming. We conclude by providing an evaluation of our methods on BLIn D and an adaptation of a causal reasoning question-answering dataset. Our empirical results highlight the effectiveness of our proposed strategies for multiple LLMs.
Researcher Affiliation	Academia	Aliakbar Nafar1, Kristen Brent Venable2,3, Parisa Kordjamshidi1 1Michigan State University 2Florida Institute for Human and Machine Cognition 3University of West Florida
Pseudocode	No	The paper describes how LLMs generate code for Python or Prob Log, and refers to 'probabilistic inference algorithms' and 'Monte Carlo Inference Algorithm', but does not present a structured pseudocode or algorithm block for its own methodology. Figure 2 shows examples of LLM-generated code as outputs of the method.
Open Source Code	Yes	Code and Dataset https://github.com/HLR/BLIn D
Open Datasets	Yes	Code and Dataset https://github.com/HLR/BLIn D. We introduce the Bayesian Linguistic Inference Dataset (BLIn D), a new dataset speciﬁcally designed to test the probabilistic reasoning capabilities of LLMs. ...an adaptation of a causal reasoning question-answering dataset, CLADDER (Jin et al. 2023).
Dataset Splits	Yes	To assess our methods, we randomly select 100 instances from each data split Vi, resulting in a total of 900 instances.
Hardware Specification	No	The paper mentions employing LLM models such as Llama3, GPT3.5, and GPT4, but it does not provide specific hardware details (like GPU/CPU models or types) used to run the experiments with these models.
Software Dependencies	No	The paper mentions using the Python library pgmpy (Ankan and Panda 2015) and Prob Log (De Raedt, Kimmig, and Toivonen 2007) but does not provide specific version numbers for these software components.
Experiment Setup	No	Refer to the Appendix of the ar Xiv version of the paper for additional information, including our models hyperparameters (the link is provided below the abstract).