reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Herded Gibbs Sampling

Authors: Yutian Chen, Luke Bornn, Nando de Freitas, Mareija Eskelin, Jing Fang, Max Welling

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Herded Gibbs is shown to outperform Gibbs in the tasks of image denoising with MRFs and named entity recognition with CRFs. However, the convergence for herded Gibbs for sparsely connected probabilistic graphical models is still an open problem. We illustrate the performance of herded Gibbs with two synthetic examples and two real experiments for image denoising and natural language processing respectively.
Researcher Affiliation	Collaboration	Yutian Chen EMAIL 7 Pancras Square, Kings Cross, London, N1C 4AG, United Kingdom Luke Bornn EMAIL Statistics & Actuarial Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A1S6, Canada Nando de Freitas EMAIL 7 Pancras Square, Kings Cross, London, N1C 4AG, United Kingdom Mareija Eskelin EMAIL Dept of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, BC, V6T1Z4, Canada Jing Fang EMAIL 1 Facebook Way, Menlo Park, CA, 94025, United States Max Welling EMAIL Informatics Institute, Science Park 904, Postbus 94323, 1090 GH, Amsterdam, Netherlands
Pseudocode	Yes	Algorithm 1 Herded Gibbs Sampling
Open Source Code	Yes	1. Code is available at http://www.mareija.ca/research/code/
Open Datasets	Yes	We used the pre-trained 3-class CRF model in the Stanford NER package (J. R. Finkel and Manning, 2005). For the test set, we used the corpus for the NIST 1999 IE-ER Evaluation.
Dataset Splits	No	The paper mentions using a 'test set' for NER (NIST 1999 IE-ER Evaluation) and '10 corrupted images' for image denoising, but does not provide specific training/validation splits or other detailed partitioning methodology needed for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions software like the 'Stanford NER package' but does not specify version numbers for any key software components or libraries.
Experiment Setup	Yes	The L2 reconstruction errors as a function of the number of iterations, for this example, are shown in Figure 9. The plot compares the herded Gibbs method against Gibbs and two versions of mean ﬁeld with diﬀerent damping factors (Murphy, 2012). Table 2: Errors of image denoising example after 30 iterations (all measurements have been scaled by 10-3). Table 3: F1 scores for Gibbs, herded Gibbs and Viterbi on the NER task. The average computational time each approach took to do inference for the entire test set is listed (in square brackets). After only 400 iterations (90.48 seconds), herded Gibbs already achieves an F1 score of 84.75, while Gibbs, even after 800 iterations (115.92 seconds) only achieves an F1 score of 84.61.