On Markov chain Monte Carlo methods for tall data
Authors: Rémi Bardenet, Arnaud Doucet, Chris Holmes
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The paper includes a dedicated "Experiments" section (Section 8) where methods are applied to "Logistic regression" and "Gamma linear regression" using datasets like "covtype". It presents numerous figures displaying "Chain histograms", "Autocorr." (autocorrelations), and "Num. lhd. evals" (number of likelihood evaluations) for various algorithms. This clearly indicates empirical evaluation and data analysis. |
| Researcher Affiliation | Academia | All listed authors are affiliated with academic institutions: Remi Bardenet is from "Universite e de Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRISt AL", and Arnaud Doucet and Chris Holmes are from "Department of Statistics, University of Oxford". The email addresses also correspond to academic or personal domains often associated with academic work (EMAIL, EMAIL, EMAIL). |
| Pseudocode | Yes | The paper contains two explicitly labeled pseudocode blocks: "Figure 1: Pseudocode of the MH algorithm targeting the distribution π." and "Figure 9: Pseudocode of the confidence MH from (Bardenet et al., 2014).". These figures present structured algorithmic steps. |
| Open Source Code | Yes | The introduction states: "All examples can be rerun or modified using the companion IPython notebook1 to the paper. 1. https://github.com/rbardenet/2017-JMLR-MCMCFor Tall Data" which provides a direct link to a GitHub repository containing the code. |
| Open Datasets | Yes | In Section 8.1.3, the paper states: "We consider the dataset covtype.binary3 described in Collobert et al. (2002). The dataset consists of 581,012 points, of which we pick n = 400, 000 as a training set, following the maximum training size in Collobert et al. (2002). 3. available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html". This provides a specific dataset name, a citation, and a direct URL for access. |
| Dataset Splits | Yes | In Section 8.1.3, for the covtype dataset, the authors state: "The dataset consists of 581,012 points, of which we pick n = 400, 000 as a training set, following the maximum training size in Collobert et al. (2002)." This explicitly defines the size of the training set used. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU or GPU models, memory, or cloud instance types used for running the experiments. It only mentions using a "disk-based database using SQLite" for managing data in Section 8. |
| Software Dependencies | No | The paper mentions using "IPython notebook" and "SQLite" in the text. However, it does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | The paper provides several specific experimental setup details, including: "The MH proposal is an isotropic Gaussian random walk, whose stepsize is first set proportional to 1/ n and then adapted during the first 1 000 iterations so as to reach 50% acceptance." (Section 2.3), "The stepsize ϵk is chosen proportional to k 1/3, following the recommendations of Teh et al. (2016)." (Section 5.1), and "The parameters are ϵ = 0.05, corresponding to the p-value threshold in the aforementioned T-test, and an initial subsample size of 100 at each iteration." (Section 6.2.2). For the improved confidence sampler, it notes setting "δ = 0.1" and running "5 independent chains for 10 000 iterations, dropping proxies every 10 iterations" (Sections 8.1.3 and 8.2.2). |