A Finite Sample Analysis of the Naive Bayes Classifier

Authors: Daniel Berend, Aryeh Kontorovich

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We revisit, from a statistical learning perspective, the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Naive Bayes weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. We derive optimality results for our estimates and also establish some structural characterizations. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. Several challenging open problems are posed, and experimental results are provided to illustrate the theory.
Researcher Affiliation Academia Daniel Berend EMAIL Department of Computer Science and Department of Mathematics Ben-Gurion University Beer Sheva, Israel Aryeh Kontorovich EMAIL Department of Computer Science Ben-Gurion University Beer Sheva, Israel
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It primarily presents mathematical derivations and proofs.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There is no mention of a repository link, an explicit code release statement, or code being available in supplementary materials.
Open Datasets No In each trial, a vector of expert competences p [0, 1]n was drawn independently componentwise, with pi Beta(1, 1). These values (i.e., αi = βi 1) were used for ˆf Ba. The results are reported in Figure 2. The paper describes a generative process for the data used in experiments (drawing expert competences from Beta distributions) rather than using a pre-existing public dataset, and no access information is provided for the generated data.
Dataset Splits No The paper describes generating data for its experiments (e.g., "In each trial, a vector of expert competences p [0, 1]n was drawn independently componentwise, with pi Beta(1, 1)."). However, it does not provide specific dataset split information (e.g., train/test/validation percentages or counts) as would be applicable for pre-existing datasets. It mentions "results were averaged over 10^5 trials," which is a simulation setting, not a dataset split.
Hardware Specification No The paper describes the experimental setup in Section 6, including the number of experts (n=5), sample sizes (mi), and number of trials (10^5). However, it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to conduct these experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python, PyTorch, or specific solvers).
Experiment Setup Yes In our experiments, we set n = 5 and the sample sizes mi were identical for all experts. The results were averaged over 10^5 trials. For ˆf Ba, we used αi = βi = 1 for all i.