reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement

Authors: Wouter Kool, Herke van Hoof, Max Welling

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents the experiments and results. In our ﬁrst experiment, we analyze diﬀerent methods for sampling without replacement: Ancestral Gumbel-Top-k sampling (Section 3), where we experiment with diﬀerent values of m to control the paralellizability of the algorithm. Rejection sampling (Section 4.4) which generates samples with replacement (using standard ancestral sampling) sequentially and rejects duplicates. We also implement a parallel version of this, that generates m samples with replacement in parallel, then rejects the duplicates and repeats this procedure until k unique samples are found. Na ıve ancestral sampling without replacement (Section 4.5). This algorithm is inherently sequential, but we also implement a na ıve parallelizable version similar to rejection sampling.
Researcher Affiliation	Collaboration	Wouter Kool EMAIL University of Amsterdam P.O. Box 19268, 1000GG, Amsterdam, The Netherlands ORTEC Houtsingel 5, 2719EA, Zoetermeer, The Netherlands Herke van Hoof EMAIL University of Amsterdam Max Welling EMAIL University of Amsterdam, CIFAR
Pseudocode	Yes	Algorithm 1 Ancestral Gumbel Topk Sampling(pθ, k, m)
Open Source Code	Yes	4. Our code is available at https://github.com/wouterkool/stochastic-beam-search.
Open Datasets	Yes	We use the pretrained model from Gehring et al. (2017) and use the wmt14.v2.en-fr.newstest2014 test set5 consisting of 3003 sentences. 5. Available at https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2.
Dataset Splits	Yes	We use the pretrained model from Gehring et al. (2017) and use the wmt14.v2.en-fr.newstest2014 test set5 consisting of 3003 sentences.
Hardware Specification	No	No specific hardware details are provided in the paper. The text mentions "the number of GPUs3 (or parallel processors on a single GPU)" but does not specify the models, quantity, or other relevant specs for their experiments.
Software Dependencies	No	The paper mentions using "fairseq (Ott et al., 2019)" but does not provide a specific version number for this software dependency, which is necessary for reproducibility.
Experiment Setup	Yes	For Sampling and Stochastic Beam Search, we control the diversity of samples generated using the softmax temperature τ (see Equation 2) used to compute the model probabilities. We use τ = 0.1, 0.2, ..., 0.8, where a higher τ results in higher diversity. Heuristically, we also vary τ for computing the scores with (deterministic) Beam Search. The diversity of Diverse Beam Search is controlled by the diversity strengths parameter, which we vary between 0.1, 0.2, ..., 0.8. We set the number of groups G equal to the sample size k, which Vijayakumar et al. (2018) reported as the best choice. ... We use lower temperatures and experiment with τ = 0.05, 0.1, 0.2, 0.5. We then use diﬀerent methods to estimate the BLEU score: Monte Carlo (MC), using Equation (20). Stochastic Beam Search (SBS), where we compute estimates using the estimator in Equation (21) and the normalized variant in Equation (23). Beam Search (BS), where we compute a deterministic beam S (the temperature τ aﬀects the scoring) and compute P y S pθ(y\|x)f(y). ... for temperatures τ = 0.05, 0.1, 0.2, 0.5 and sample sizes k = 1 to 250.