reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Combination of Probabilistic Classifiers using Multivariate Normal Mixtures

Authors: Gregor Pirš, Erik Štrumbelj

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on several toy and real-world data sets, including a case study on air-pollution forecasting, shows that the method outperforms other methods, while being robust and easy to use. We empirically evaluated our method on several toy and real-world data sets, and compared it to related methods.
Researcher Affiliation	Academia	Gregor Pirˇs EMAIL Erik ˇStrumbelj EMAIL University of Ljubljana Faculty of Computer and Information Science Veˇcna pot 113, 1000 Ljubljana, Slovenia
Pseudocode	No	The paper describes the generative model and derives full-conditional distributions for variables to construct a Gibbs sampler, but it does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository. The license information provided is for the paper itself, not code.
Open Datasets	Yes	Two data sets were constructed using the Stat Log DNA data set in a way similar to the experiments in Kim and Ghahramani (2012). We evaluate methods from Section 2 on four air-pollution data sets used in Faganeli Pucer et al. (2018). The data were provided by ARSO.
Dataset Splits	Yes	We generated 500 samples for training and 5000 samples for testing. The data set has 5434 samples, we used 2000 for training and 3434 for testing. DNA A: ... 400 of which were used for training and 786 for testing. DNA B: ... 400 of which were used for training and 786 for testing. We used two thirds of observations for training and the rest for testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper discusses various models and methods like 'Gibbs sampler', 'multinomial logistic regression', 'random forests', and 'Gaussian processes', but it does not specify any software libraries or programming languages with version numbers that would be needed for replication.
Experiment Setup	Yes	For IBCC and its extension, we selected priors that assume each class has the same prior probability. We used the same priors (α0,j = 1) for all confusion matrices, which represents a weak belief that the models are random. For MM we selected vague priors that put prior belief that the mean values of transformed predictions are zero with large variances. We set the same priors for Bayesian methods over all data sets. For λ, we use uniform priors λi iid U(0, 105). We set the maximum number of mixture components for MM to 15, however this number dynamically falls, depending on the problem, as we described in Section 2.1.